

00:00
Bridging the Skills Gap: Insights from Agentic Coding Training
agentic workflows
software factories
theory of constraints
ai native
outcomes over outputs
dark factories
Daniel Jones and Benedict Stemmelt dive into the technical and organizational shifts of agentic software engineering. They move past the hype to discuss the transition from manual coding to building software factories that automate feature delivery. The conversation explores critical technical hurdles, including context window reasoning degradation and the necessity of reverse engineering model defaults to steer agents effectively. By referencing DORA 2025 metrics, they illustrate why AI adoption accelerates high-performing teams while exposing systemic bottlenecks in lower-maturity organizations.
Hosted by
Deejay
Featuring
Benedikt Stemmildt
Guest Role & Company
Co-Founder & Co-CEO @ hackers&wizards
Guest Socials
Episode Transcript
Daniel Jones (00:02) Welcome to the Waves of Innovation podcast. am DJ, your host. In this episode, I am talking to Benedict Stemmelt from Hackers and Wizards who help people with agentic coding. They deliver training to organizations to upskill their developers and help with the change that's needed around that. If you've been listening to the podcast for a while, you will know that Resync also do something similar. And I've been involved in that kind of work of developing and delivering training courses. Benedict and I are on a shared Slack. And we found each other by often offering similar advice to people about how to adopt agentic coding practices in their organizations. So I thought it'd be really fun to get Benedict on the podcast so we can exchange notes and help share some of the lessons learned. So if you're a developer looking for some advice as to the kind of things you should know when adopting agentic coding, or you are an engineering manager thinking about how you might adopt this more broadly, this episode is for you. Enjoy. Daniel Jones (01:00) Benedict Stemelt, Stemelt, Stemelt. I've been thinking all morning Stemelt and not Stemelt and then I go and get it wrong. Benedict Stemelt, welcome to the Waves of Innovation podcast. Would you like to tell people like what you are doing at the moment and what you've been doing recently? Bene | hackers&wizards (01:03) Yeah, that's correct. Yeah, hello. Thank you so much for having me. You're not the first one to fuck up the name, don't worry about it. We're actually the only family on the planet that has this name. So this is like a fun fact. Yeah, what I've been doing recently is... Daniel Jones (01:20) Hahaha wow. Bene | hackers&wizards (01:30) Like the last three years, actually, I'm super deep into agentic software engineering. It's like a rabbit hole, like a little bit like an addiction where you can't come out. And yeah, I'm helping companies and developers to kind of get... Daniel Jones (01:37) You Bene | hackers&wizards (01:47) get the head around that topic, like what does it mean, how do we actually do that, how can we scale that, and doing trainings and figuring out how all of this stuff actually works. Daniel Jones (01:59) Yeah, and that's one of the reasons why I thought it would be fun for us to talk because we ended up talking together on a Slack and it became clear that we're both doing training with people, of getting used to agentic coding and there's so much, everything's changing so quickly. There's such a demand for this kind of stuff. There are so many insights from so many places. So kind of exchanging notes on, you know, what you've seen, what has worked well for you, the kind of questions that people are asking. I thought it would be good to... you know, exchange notes on all of that kind of stuff. So with the training that you've been delivering, I mean, you said that you've been into agentic coding for a few years. Did that kind of start off as a practitioner and then you realized that, I know quite a lot of stuff. I can share this with people or did it, did you come to it from a different angle? Bene | hackers&wizards (02:44) Yeah, quite a different angle. So what you need to know about me, I'm like a technologist by heart. I started at a very early age, 1999. I was nine years old and I always envy people that were able to start earlier because my first one was like a Pentium 4 with a turbo button and so on. But yeah, but I always envy people like C64 or even earlier some Amiga and so on. Daniel Jones (02:59) You nice. Bene | hackers&wizards (03:10) My first, I mean, it already had a CD drive. yeah, yeah, yeah. So it was already my 750 megahertz. yeah, so I could acknowledge this by heart. I studied that afterwards and I made a typical and classical career, like becoming a senior engineer, then becoming like an architect and a CIO. And that, mean... Daniel Jones (03:14) wow. Bene | hackers&wizards (03:33) I always have the love for coding, but in the career, turns out that coding is not like it's drifting slowly away from you because you get into more management positions. And that was actually the angle that brought me to agentic engineering because I, it was very hard for me to keep up with coding because I didn't have enough time for that. And then ChetGPT came out and so on. And I was like, that's interesting. And the first tool that I noticed was AIDR. It was not really an agent yet, but I tried it out and I was like, this is like very useful because now I can be more efficient. can do side projects with the little time that I have. And that is the angle that I took into it. And I had the greatest, I don't know how this happened, but I had the chance to be one of the first ones to try out Cloud Code even before it was, way before it was released. Daniel Jones (04:23) Nice. Bene | hackers&wizards (04:24) Yeah, I don't know how this happened. I wrote an email to them and then I got access. I think it was beta or even before, I don't know. And when I tried it the first time, I was like, okay, this is going to change things. yeah, and that was mid 2024 or something, right? So yeah, a few months after Boris did his first tweet about it or something. Yeah. Daniel Jones (04:43) So wow. Awesome. There that's, yeah, in very early, that's a great head start to have. you you were talking there about the being further away from the code and then being able to use these tools to kind of work on side projects. As somebody who's taken a similar trajectory myself, that gets to a point where it's like, I know I can program. Can I remember all the syntax? Can I remember how to set up my... bloody workstation like I've got to set up my environment and now the dependencies are screwed and this version this doesn't work with that version that I just want to write some code I want to build something and the number of senior tech leaders that I've kind of encountered in the last six months and especially over Christmas 25 going into 26 who had a bit of free time all just all of a sudden rediscovering the joy of creation it's like I can actually get on and do stuff now it's it's been quite quite nice to see Bene | hackers&wizards (05:35) Yeah. Yeah. Yeah. Daniel Jones (05:40) So you got access to Claude code quite early on and kind of got eyes onto it for a lot of the rest of us. So from there, what happened next? Because you're delivering training to people now and it sounds like things are going quite well. Before we started recording, you mentioned that, you you're expanding and you've got other people working with you. How did you go from being early in Claude code into helping other people with agentic coding? Bene | hackers&wizards (06:04) Yes, I'd say it actually is like similar to how everything kind of happened to me until now. just kind of just fall to me, right? So I'm not the I'm a very I try I listen to my gut a lot, right? I'm like a like a very gut listening person, although I'm also like logical and so on. But I don't know, I just started to work on this and use it and Then in the beginning of 2025, it became clear that that so I was working with a consultancy at that time. And I was always looking for like founding my own company. And I was always searching for what what would be the thing that I do. And for a long time, it was really I really had this this internal Sentence that that doing consulting is not enough for your own company that you need to build a product or something, right? like products are great and and In this journey that I had with with Claude in the beginning I kind of realized that that maybe this is shifting away from product is great to You need to know how to use these tools because product is something that you can just just build yourself right and you don't really need to buy a product and this sentence became Daniel Jones (07:12) Yeah. Bene | hackers&wizards (07:16) became less and less loud in my mind. then, yeah, then first, our first conversations with our clients back then, like turned out to that they are interested in the topic. And then the first kind of outside people came and said, well, What is this? How do I do that? And then it started just like naturally and I was offering them. Yeah, I can show your team. I can do like a talk or like a demo and how this works. Yeah, and as you mentioned, the Christmas 2025 was actually like a real turning point because as you said, people had time and it was like... Opus four five was released early before Christmas. So it was like a bump in the, in the capability as well. and after that, yeah, people were just like, we need to do this now. Right. And that was the point where, where this actually all speeded up insanely fast. And I'm, currently kind of overwhelmed with all of that a little bit, but it's okay. We are managing. Yeah. Yeah. It's a good problem. Daniel Jones (08:13) That's a good problem to have. Yeah, it's a good problem to have. yeah, I think, you know, 2025, I remember talking to customers and speaking to them and like, you know, what are you doing with the agentic coding tools? So often, the kind of work that Resync does, we do kind of AI strategy type stuff. So it's as well as agentic coding, might be how are you going to build Bene | hackers&wizards (08:31) Mm-hmm. Daniel Jones (08:35) you know, machine learning features into your value streams. So you're delivering value, like the product features are AI based or how are you going to upskill the people in accounts and marketing? And, you know, we had a few engagements where that was the majority of what we were doing. And then how the developers were working was a minority part. And I remember talking to a couple of customers and saying, you know, what are you doing with the agented coding? And the answer last year was really commonly. Bene | hackers&wizards (08:39) Mm-hmm. Mm-hmm. Mm-hmm. Daniel Jones (08:59) well, we let them use the tools that they want. And you know, if people want to try it out, then they can try it out. And, you know, I'm happy with that. And I think probably, like you say, when Opus four five released, and the pace of things started to pick up, there's very much been a shift that in my mind that now leaders are thinking we need to do this systemically, we need to do this like department wide, we need to do this in a structured fashion rather than just having different people using different tools to different extents. It seems that people have realized the value now and really want to make it a core part of their like official development process rather than just having individuals maybe doing some if they fancy it. Bene | hackers&wizards (09:38) Yeah, yeah, that's true. And what I find really interesting here is that, that this is kind of like also a natural progression, right? So first, a few people are individually being more efficient because they use that and then they hit kind of like a ceiling because they cannot be more efficient than being more efficient on their own. They work in a team, right? So now you need to figure out how can we as a team be more efficient. And then people are focusing only on the coding part in the beginning. So then you, then you hit the ceiling of, okay, now we kind of, create back pressure on the, on the product owners because we are very fast and, and, and, but we have more questions even quicker and we, need more input even quicker because we want to know what to develop now because ours, what we plan to do in the sprint is not done in like half the sprint. And, and then at this point, I think this is the point where then the upper leadership realizes, okay, we kind of need to do it systematically, right? Because otherwise they will hit this bottleneck over and over again. And this will kind of be just a very slowly progressing adoption. And other leaders that I see are... I'm on the other spectrum that they kind of give everybody the tools, but no one is using them, right? Because people, they don't really know how or they tried it like half a year ago and then said, well, it's not working. Yeah. Daniel Jones (10:58) Yeah, that whole thing of like, tried it half a year ago and it wasn't very good is probably, I think one of the biggest bits of friction in agentic coding adoption. And, know, it's a legitimate experience. People maybe tried to take an extreme case, you know, they tried GitHub co-pilot auto-complete two years ago and it was rubbish. You know, I encountered someone like that in the last few weeks. And if you haven't tried the tools recently, Bene | hackers&wizards (11:03) Yeah. Yes. Mm-hmm. Daniel Jones (11:23) it's not clear how much they've come on because the things they do are not massively different from an outsider's perspective. So you assume, not much has changed. But really the capabilities have come on massively. like you mentioned, having that kind of diversity of experience in an organization, of for those of us that are trying to train people, it's a double-edged sword. because you've got some people that are really like far ahead and are going to be bored by the entry level material. And then you've got some people who have only ever copy and pasted code from chat GPT. You know, they've not even used kind of AI features in their IDE. So that's always a challenge as a training provider trying to cater to both audiences. It's like, well, if you know all of this already, maybe you can help out your colleagues that don't know this. That would be a good thing to do. Bene | hackers&wizards (11:49) Mm-hmm. Mm-hmm. Yeah, but I mean, have a little, I have similar experiences, but also what I see is that people that are using it a lot often Often is hard, but like they might not understand how this stuff is actually working internally. They kind of sometimes my experience is often that they often lack the fundamentals of what is an agent tool and how does agent an agent do a tool call? How is the context actually built? And then they hit this ceiling of productivity as well because this, I mean, you hear that all the time as well, probably that like, I told it not to do that. And it did that. And I'm like, yeah, but I mean, your context was, was it 98 % and, and you didn't really put it in the right place and in your cloud MD or it didn't pick it up in the right, on the right time. And even these people that are really far ahead often lack this fundamentals. And that is something that is helpful for the training because I feared that in the beginning as well that, there's a super diverse group and, God, what would I do with the ones that already know all of that stuff? I always start with a fundamental overview of like what is actually an agent compared to an LLM and so on. And then even these people are kind of like, wow, this explains a lot and, I can pick up on that. And that is kind of interesting that, yeah, that it's not that... Using these tools doesn't mean that you understand them. It's probably the sentence that I'm looking for. Yeah. Daniel Jones (13:35) Yes. Yeah, because they're so magic. you know, when they work, they work and, you know, I was explaining I made the mistake. It was no mistake. I was on a video gaming website and over Christmas and people were video gamers hate gen AI, generally, like, 99 % of the population think it's the worst thing to ever happen. And so I was like, you know what, I'm going to reply to some of these comments as an informed person saying, you know, it's not all bad. And this will really help make games cheaper because it will make, you know, production cycles shorter. And I ended up explaining that I was doing some training and somebody replied with an extremely flippant reply. I don't think they swore. But they did say like, lol, why do you need to teach people how to use AI assistance? And I was like, you know what, that's a fair point, because when they work, they work. So it is that unhappy path, like teaching people why they go wrong. And like you say, what's happening under the hood. Bene | hackers&wizards (14:23) Mm-hmm. Mm-hmm. Daniel Jones (14:30) And you you mentioned things about like tool calls and the difference between what the agent does, which, you know, isn't very much and how much responsibility lies with the model. When people are just interacting with Claude code or get up copilot, you know, it's just all one big kind of interface to them. They, don't see all of the bits, under the hood. Bene | hackers&wizards (14:38) yet Yes. Yeah, what to me, it often feels like, like, them first of all, like the the fundamentals and then the approach and how to actually, like make it work when it does not. And making it work when it does not is for me very similar to something like reverse engineering, because I mean, what does what does it mean to not work? It not work often means not in the way that you wanted it to. Because for someone else, it's might might be it worked. Right? Because it's like, it depends a little bit on what you feel is the reason well, how the result should be. And if it's a different one that actually is also okay, then you might still not be okay, because It's not your opinion on how it should work. But it all relies to the the defaults of the model, right? Like the model has a has a specific default path and its training that it will take. And you need to figure out what the default path is. And then you need to kind of steer it to a different path if you want it to be a different one. There was actually an interesting study that is going around here in Europe at the moment. I'm not sure if it's reaching you as well, where a group of scientists in Zurich, they found out that the initial context information that is generally context that has been generated by Claude Cote and the model is not very helpful. It only creates overhead because it increases the size of the context and then it's just in the model anyways. So it's just like putting stuff in there that is already in there. And I think that's a fair point. And a lot of people that I see are making this mistake that they kind of generate the inert in the beginning and so on. And this is like not helpful, but it's also like not game-changingly helpful because it's kind of putting on top of the model what is already in there. And what I really see is when people really understand how to actually work with it is figuring out after a conversation, where did I need it to steer it and extract that information and put it into the files because this is not what is default in the model. And this kind of reverse engineering is, think, one of the major things that we currently trying to teach and trying to, and also it's like, it's not that easy to do that. It's kind of, I mean, I'm not sure how often you do reverse engineering before doing AI, right? Like not that often probably. Daniel Jones (17:14) It's, it's interesting you say that because I see exactly the same thing. And it's kind of interesting seeing like the different trends and how things very quickly like what people were saying, are you should definitely do this on LinkedIn in September 25 is now like recognized to be an anti pattern. So that we'll try and dig out the owl. I went to a customer went to a dev I named the customer had a great time with two other people they have watching a band called architects. Bene | hackers&wizards (17:28) Yeah. Mm-hmm. Daniel Jones (17:43) I'm 43. I thought, sorry, I'm going to get involved in the mosh pit almost immediately broke my knuckle. And I've just like bang it on the table. That was a month ago. dear. Anyway. Yes. So with the like putting lots of information in your initial context, we'll try and find the link to that paper and put it in the comments or the description at the end. Like that was a big thing. Everyone was saying, whenever you find your agent doing something that it shouldn't Bene | hackers&wizards (17:50) I'm so sorry. Daniel Jones (18:08) add it to your agent's MD. But that, like you say, just adds lots of the context. It's probably guidance that the, you know, is already, or knowledge that's already in the model. might already be in the system prompt for all you know. And I think to your point about people not understanding how things work under the hood, realizing that that guidance goes to the model on every single invocation, whether it's relevant or not, having a agent's MD or a Claude MD full of Bene | hackers&wizards (18:10) Mm-hmm. Yeah. Daniel Jones (18:36) every bit of advice for every single possible situation is definitely going to make things worse. Like you're just adding to your context. And one of the things that I think maybe is not understood very well is how the difference between effective context and the maximum size of context window. Like the model providers will very happily tell you, it's million tokens of context window. And then the academic research looks into it and it's like, Bene | hackers&wizards (18:58) Yeah. Daniel Jones (19:03) After a mere 30,000 tokens and you know, a token is what, there's 1.4 tokens to a word, something like that. So after 30,000 tokens, I think in 2024, the NoLima paper showed that reasoning ability drops off 15 % once you get over 30,000 tokens. There was another paper, I can't remember the name of it, that I looked at from 2025 that was looking at kind of GPT 4.1 era models and showed the same thing. Bene | hackers&wizards (19:11) Mm-hmm. Daniel Jones (19:29) So between 30,000 and 30,000, 64,000 tokens, reasoning ability dropped off by, you know, double digit percentage points. so folks don't necessarily know that there is that cliff. And the more that you stuff into your context, the more likely the model is to get confused or to do the wrong thing. And to stop it doing the wrong thing, they're stuffing more stuff into the context, which is kind of counterproductive. Bene | hackers&wizards (19:52) Yeah. Yeah. And there's this argument of we just need bigger context windows, right? I'm like, not that I like small context windows, actually, because I'm forced to manage them. And when I'm forced to manage them, their quality is getting better because I'm just forced to do that. But this is kind of an interesting angle, I think, in regards to the question of what what should we actually do in this training? Or what do we do in the trainings? Because we don't actually try to teach how a certain tool works. Like this is cloud code and this is how you use cloud code, but exactly these concepts that we are debating about here, right? Like you need to really understand or at least have an idea what that could mean, right? Because even for me, it's so hard to pick up on all the changes in terms of what has been figured out, right? And the studies that you are mentioning, are with a new model, this all again, completely different, right? It could be 40,000 or it could be 50,000 tokens for a different model, or it could be less. And also the stuff that you put in your context to steer it is also again, completely different if you have a new model or another model, because the defaults in this other model are different, right? And that brings me one argument that I really find interesting or one thing that is currently happening. is always kind of strange to me because people are kind of putting out a lot of these so-called frameworks like B-Med and Spec-Kit and so on. Like there's a lot of these stuff coming up and people and companies are always looking for something like that, right? Like, we can use that. So we don't need to, it's easier to use something and then, and pick up on. And it's so interesting that from, from what I see these things, they, they tend to not work. As good as if you just kind of figure that out in your own, in your own project, what you actually need, right? Like starting without anything and then just using these frameworks kind of like an inspiration, right? So when you see something not working very well, then you can maybe look into what, these frameworks offer as an answer. And then you can leverage that and only that part and make, make adjustments to your own setup. Daniel Jones (21:38) You Bene | hackers&wizards (22:01) that includes something like that. Yeah, but it's, this is also like a super interesting debate and not really completely sorted out yet. Do we need the, these frameworks or I mean, yeah. Is it even a framework? I don't know. Daniel Jones (22:11) Yeah, and yeah, I mean, it is a bunch of markdown a framework when I was doing a module on spectrum development last week, I was saying to people like, it's just a load of markdown files and a CLI to move some markdown files around like in terms of software. There's not much to it. We were doing some stuff with BMAD earlier this week. And the requirements we fed into BMAD were basically making online agile retro board with a happy, meh and sad column. Bene | hackers&wizards (22:30) Mm-hmm. Daniel Jones (22:39) People should be able to add things to the board. Everybody should see whenever the state changes, so multiple sessions. And users should be able to upvote things. Beamad took three hours to write 312 spec files for that very simple set of requirements and maxed out my colleague Michael's core code usage window for five hours. It went to extreme lengths and... Yeah, I'm not sure that that's definitely a better answer than figuring out yourself. And to your point about kind of jumping on a framework as opposed to finding your own way there, just like with using agents in the first place, like when you get to the destination and you don't take the journey, like you don't understand what you're missing. You don't understand the pros and cons of why, like maybe if you had... never done any agentic coding and you jumped straight into using BMAD, you'd be like, 312 spec files for a simple agile retro board is perfectly normal. And it's perfectly normal that it takes three hours to have not even writing any code, just writing Markdown. So yeah, that idea of building yourself up is important. And it just makes me think of our kind of main contact point, a dev, Tomash pointed out that Bene | hackers&wizards (23:37) Yeah. Daniel Jones (23:54) It's tempting for organizations to defer getting involved in this training and be like, we'll wait till it's settled science. It's too early to jump on now. But then you miss all of the foundational kind of education that you need to understand, whatever it is that emerges in the next six months, or, know, whether we're all using software factories by the end of 2026. If you don't get involved now and do all of the kind of stuff that is current or has led us to this point, you won't understand properly. the tools that you're using, abstract over all of this. Bene | hackers&wizards (24:27) Yeah, I think the real challenge is that even if we end up having something like a software factory agent, super multi agent system, in my experience, it will not be able to build anything for anybody. That's why I also don't like these frameworks, right? Because You need to figure out what is do you need to reverse engineer? is currently in the model? How does the the agent behave and then adjust it into your situation? I don't want to say context because people are getting confused with context and context but to your to your situation your like your environment like how does the company work? What is your domain? What is the the language of your domain and so on and because a lot of that stuff is not in the model and Daniel Jones (25:03) you Bene | hackers&wizards (25:18) then if there's a generalistic feature factory going out, I mean, even if there would be one, I assume that's similar to Claude Code or Codex, who is actually kind of like a general feature developing factory, would need to kind of get information or steering regarding your context information that is not in there yet. And that is exactly the skill that you need to learn how to figure that out, what is missing, how to reverse engineer that stuff and how to configure it. And I think it will still be the same next year when, when you have these more advanced agents. And that is kind of the shame that I see that people are, they are missing out on the wrong assumption on the assumption that they, that what they learn today is not what they need. I mean, I never learned assembler, right? And I don't need to learn it. can do so. This is a fair point to say I don't need to learn that and, not to, yeah, to see every evolving step. I mean, yeah, you know what I mean. But I think for, for agentic engineering, this argument is kind of not working very well. Yeah. Daniel Jones (26:21) It's, I've always, for a long time as an engineer, I was crap at my job and anyone that worked with me in the early 2000s will absolutely know that. Like no unit testing, no CI-CD, like deploying straight to prod. It was generally not very sophisticated what we were doing. I've moved on since then, I'd like you all to know. Bene | hackers&wizards (26:32) You Ha. Daniel Jones (26:39) But for a long time, I was like, I just want to focus on what I do. And I was a Java developer and I was like, I don't want to know how any of this works under the hood. It's a waste of my time. If I put some knowledge in one ear, then some other knowledge will fall out the other ear. And you know, my brain has only got enough space for a certain number of things. And that was a really terrible mental model to have. And when I started working at Open Credo and Pivotal and working on cloud platforms, I needed to understand Linux better. needed to understand infrastructure better. And it made me such a better developer. Like knowing just the, the layer of abstraction just below me meant that I could reason about what I was doing so much better. And importantly, when things went wrong, I knew how to debug stuff. Whereas when I was a Java developer only doing Java, if I got some like weird error message from the operating system, was like, I don't know what to do. Stack overflow. And I think, um, With agentic coding, having an idea of what's happening under the hood means that, like you were saying earlier, when it goes wrong, you can try and fix that. What are the main things that you see people maybe misunderstanding or not being aware of? We talked about context windows and not realizing they're overloading them. From a developer's point of view, what other things do you see that people kind of aren't aware of and maybe need the lights turning on for them. Bene | hackers&wizards (28:05) Mm-hmm. Yeah, one of the major discussions that I have in most of our sessions is that people have this idea of that the agent should, with the first attempt, build it in the correct way. And I mean, even If they just think a little bit ahead, this is not the agent's first attempt because an agent is doing like loops and doing multiple attempts until it thinks, okay, now this is kind of somewhat presentable or I'm stuck or I have a question or anything. So it is actually doing multiple iterations already until it presents you with whatever it will present you. And then you as a developer. say, well, yeah, this is not working or this is not the result that I expected. then I always take a year, but this is just like, if you develop software, you are doing it the same way, right? You had, you are building something and maybe you get it to a point where it works. And then if you are a senior developer, then you don't stop and say it's done. Like this is what juniors do, right? That's one of the main differences or what you describe, like not writing the tests and not doing the stuff, right. And just deploying it to production. What you do is you do another iteration. just say, okay, let's look through that again. Where can I create an abstraction? Where can I make the code simpler? And you work with these iterations. Because you are human, you're super slow doing iterations. And what you need to figure out is how to actually give the agents the opportunity to do more iterations until you validate their results. One phenomenon I think, which pointed that out quite clearly was, Geoffrey Huntley when he, when he invented the, the Ralph loop thing, like just basically doing the same prompt over and over again. Like people are now kind of creating memory for that and so on, like making it improve it, but I don't know if it's improvement. I, my experience was just doing it without any memories. Also like super awesome already. And basically letting the agent do more iterations on its own work until it presents you with something like this is the result now. Yeah. Daniel Jones (30:15) I think that the Ralph Wiggum loop experiment was really insightful in that the most important thing about it, think, demonstrating that the model and agent loop would, it avoided getting stuck in a corner. Like I just remember reading the blog post and I think Jeff mentioned something like it for about eight hours, it seemed to be not making any progress, but because of the non-determinism in the model. eventually it shook itself out of that corner, it avoided the local maxima and found another path to proceed. And because we're all used to working with deterministic systems, I think maybe the assumption we all had was, well, if it gets it wrong, it's always going to get it wrong. And that doesn't seem to be the case. One from the Ralph Wiggum loop and you know, the whole building up entire Gen Z programming language over three months unattended. But also there was the whole, there was some research earlier this year. I can't remember who it was. It wasn't the Cooper bench stuff. It was, I think it was Anthropic that, or was it, I'm going to have to check myself to make sure I'm not telling lies now. There was some research earlier this year that showed that when models fail, the more advanced models, they fail in different ways. So, when more advanced models do more advanced tasks, they fail in different ways each attempt. There's no like structural failure where they always fail in the same way, which is what you would expect from a deterministic system. So what this suggests is that rolling the dice more times means that, okay, if it failed this time, it will fail in this way. But if I get it to try again, or to check its work, it's not going to fail necessarily in the same way the second time. So we just give it more attempts. and it will, you know, it's not going to end up going down the wrong path and staying there. Bene | hackers&wizards (31:59) Yeah. I mean, it's, again, as I said, I think that's one of the biggest things that people miss out on, right? Is this judging too early that the result is not good and not doing another loop, but also it's again, not that easy to, to figure out like, like, is it worth doing another loop, right? Like, should I do another loop or not? And, we don't have the, the, mean, it's the same for developers in, in, And for AI times, when I was refactoring my code and I'm also needing to think like, it worth doing another loop? Should I refactor it one more time? Is it better afterwards? If I do one more abstraction? Or is it not better? And I have like a moral or a gut feeling because I've experienced, okay, now it's enough for this thing that we are building here. If it's a prototype, I'm like not doing any refactoring at all. And if it's like a, yeah. We kind of don't have these experiences with AI yet. We just don't know yet because we are not using it enough and we don't have the gut feeling yet. And people that use it more generally have a better gut feeling on now we should do another loop and that will fix it. Or we should not because it will not end up to fix it in this way. We need to do it differently. Yeah. Daniel Jones (33:08) Yeah, just before I forget the paper I was talking about was the anthropic and it's called the hot mess of AI and it's about when models fail, they're more likely to fail in different ways each time with more advanced models. Some people have been posting that on LinkedIn and misinterpreting it as more advanced models are more likely to fail. That's not what the paper says. Just on your point about kind of knowing when to stop and when to continue iterating. Bene | hackers&wizards (33:27) you Daniel Jones (33:33) One of the things I've noticed about Opus four six is it seems to be doing a better job because the comment I was going to make was like, models that if you ask a model, can this be improved? You know, they're trained to always say yes. And yeah, I can pick holes in anything. you know, as somebody that has, been working with a colleague who like, when I pushed them kind of proposals for a blog post or a, or a book, runs it through Gemini and goes Gemini, what can be improved here? Bene | hackers&wizards (33:46) Yes! Yeah! Daniel Jones (33:59) Gemini always answers with something. Gemini is never like, it'll do. But so I was going to say that, know, models like, you know, the 4.5 series from Anthropic would typically tend to always find something to improve. But I've been working on a little tool that I think of like local CI. Maybe we're too big a tangent here, where one of the other things that a lot of people don't know and realize about agentic coding tools is like awareness gaps is how I phrase it, but you can get Claude code or anything else to make some amazing code that's more complicated than an idiot like me can manage. But then it will forget that it left some files left lying around or it left some dead code. So I've built a little thing that's hopefully I'll be kind of open sourcing soon where you make a commit and then you have a, you define a series of prompts. Bene | hackers&wizards (34:38) Mm-hmm. Daniel Jones (34:47) and it watches your main branch and goes, okay, there's some changes here. I'm going to now deduplicate all of the code and then it will pass off to the next agent in the list with a prompt you define going, I'm going to check for dead code and so on and so forth. noticed that four six Opus four six is doing a better job. I see the logs of this little tool saying, nah, everything's fine here. So that's, that's something that's encouraging to see that they know when to know up. other thing related to iteration and should I iterate more or is this enough is also like the behavioral change in people adopting agentic coding of if it does the wrong thing, throw it away. Like, either, either fix forward or do a git reset. Getting over the kind of, when you are coding things by hand, it's a long. big investments and you've thought about it and it's your code and you spent hours doing it. When an agent does something and you don't like the result, just, okay, I'll throw that in the bin and try again. The agent's not going to be offended and you're still going much faster than you would be. So kind of encouraging people to, you know, start over again and wipe the slate clean is another thing that I see people, it's not so much knowledge. It's just, behavior change. Like they've got to change some of their assumptions about how they go about developing software. Bene | hackers&wizards (36:01) Yeah, and this again leads us to a few topics back, right? Like, what do we do in the training? We need to focus on that behavioral change and not on kind of like, this is the tool and this is how it works. And what you mentioned about the model is also quite interesting because earlier we talked about like when I made an experience half a year ago, this is like completely different now. And for the model actually, when Opus 45 was released in December, it was this huge bump. But it was still very much a pleaser, right? It found everything, it still said like, I can improve here and improve in tone. And for six, which was released like a few weeks ago, right? And there's like only two months between these models, right? You need to re-wild that. And this is like completely different. It's kind of also pushbacking on you. Like, this is not a good idea. You should not do it like this, my little human. I wouldn't do it like this. Daniel Jones (36:39) You Yeah, yeah. Bene | hackers&wizards (36:49) And codex is even more intense than that. 5.3 codex model, I really like it a lot. I'm not a huge OpenAI fan, but this 5.3 codex model is super, I see like it's insane. It's really pushbacking on you when you have bad ideas or exploring different angles and telling you to consider a different way. So it's really nice. and what you describe with kind of... Yeah, this tool and this throwing away the code and so on, this all often leads me to the, like, what are we actually going to achieve here? Like, or what are we trying to achieve with it? A lot of people are currently using this. Like as an X or as a, as their tool set in their, in their own workshop, right? They use it by hand. And, the next evolution of these things. And I see that for, for some of our, clients already is like trying to figure out how that can run autonomously without you needing to interfere. And this is actually where now the fun is. I think, think, because you need to figure out like, how can you actually configure it in a way? that you do not need to interact with it all the time and yourself need to say, throw that stuff away and start over. The agent needs to find that out on its own that it made a kind of mess here and should throw the code away and start over. And this is actually a fun thing that we are also trying to incorporate in our training at the moment. Because on the conceptual level, it's quite easy to theoretically explain. But it's quite hard to demonstrate it and to really feel it because you need to set up this whole pipeline with multiple agent steps and doing Lang graph or N8n or anything that will kind of run the loops for you on the cloud and so on. Because if you just push it, like install the GitHub app and just push it to it, it will utterly fail because it's just like one session that you do manually as well on your computer. And so, you know, it's not going to work because it needs input and guidance, right? Because it doesn't have the loops and doesn't have the... Daniel Jones (38:38) . Bene | hackers&wizards (38:48) And currently we are trying to do that by simple scripting. It's kind of for the training, it works quite well. I mean, it's not for like not the end of the line for if you use it in production, but we simply script some, I guess you said for your git commit, right? It does some prompts and this is scripted. So we can script kind of doing multiple iterations and doing a verification afterwards. And this kind of abstracts away this, I am building the feature to I am building the factory that will build the feature. And now I need to know how to build this factory. And now we are back to the reverse engineering part because now I need to do the reverse engineering to know how to build the factory and to use, mean, well, if Wigum is one factory, it's a very brute force factory just looping over it. But you would probably want to do it in a more optimized way of like using cheaper models and faster models where possible and so on. And this opens up a whole new world for software development because yeah, you're, I really find this fascinating. I know I'm blabbering here a little bit, but yesterday there was this new study from Anthropic that was showing the, where are agents actually used in which industries or in which, sections of, of your company. And, they did that half a year ago as well. And software engineering was kind of like 8 % and like place 15 or something. And if you look at the image now, we need to share it later because it's just awesome. place one is, first place is, software engineering with 93 % or something like, like this, this bar. And then the second bar is kind of. Daniel Jones (40:15) you Hahaha Bene | hackers&wizards (40:26) marketing or anything like with this bar like like 10 % or something. It's like exploded because everybody is building these factories that are then building their stuff, right? It's insane. Daniel Jones (40:36) Yeah. I mean, we, Odevo, there's a team there that are building a software factory. Having gone through this training, kind of seen the direction of travel. And you mentioned they're building the factory that builds the features. And for years, it has frustrated me as a, know, my background was in cloud platforms and organizational transformation. So trying to help people go faster once they had good cloud platforms and automated parts of production. And I would constantly say to people like the thing that a lot of engineering managers and engineering leaders don't realize is their job is not to make sure that features get shipped. Their job is to build the machine that ships the field features. And I'm kind of both happy and sad that Bene | hackers&wizards (41:19) Yes. Daniel Jones (41:24) People just didn't realize that and it didn't happen enough. And people weren't doing things like value stream management and looking at their processes and optimizing them. But now we're going to have engineers who aren't writing low level code anymore, who are really interested in optimizing things and tweaking them, making them go faster. And those people are going to be the ones who are building the factories to ship the features. So we might finally get to a point where people care about flow efficiency and throughput and all of those kinds of things. Bene | hackers&wizards (41:31) Yes. Mm-hmm. Daniel Jones (41:53) It's just it will be the engineers doing it because they're nerds and they like tweaking things. Hopefully the managers above them that never realized that that was their job when they're running a department won't kind of cause other problems and obstructions higher up the chain. it's it will. Bene | hackers&wizards (41:56) Yes. Yes. But this is so interesting, Daniel, you know, because there's so much similarity between these two things now, because, and this is hard for engineers too, because now you kind of, need to, basically you need to create an environment where features are being built quickly in the, in the great quality. And previously you needed to build an environment of like, organization, Like people, departments, communication flow between people and processes between the people. And now you just need to do this, like just need to do the same for an agent system, right? But what is interesting here is that for most of the clients that we work with, they, and I mentioned these bottlenecks, right? This is kind of the same thing, right? You, you, you get this bottleneck because your machine, in this case, your organization is not set up the right way to spit up features very fast and you need to change it. So there's very quickly organizational change, organizational design involves when you do agentic engineering, because you need to reshape your organization. And, and, and often also architecture is in the way because if you have like a monolithic system and you have five teams working on this monolithic system with engineering that is not going to work. They are going to do so many merge requests and merge conflicts and so on. It's not going to work. Deploying like every single minute if the deployment takes an hour is also not going to work. So there's I mean and you can kind of see the similarity between Daniel Jones (43:33) Yes. Bene | hackers&wizards (43:39) this and then building this machine as an agent. and I feel that you are right that managers and I mean, for humans, it's hard to do that and to change themselves and to shape this human machine is not that easy and and building this agent machine will probably be much faster. So yeah, probably engineers will be able to figure it out because agents are not like, man, I don't, I don't want to, I don't, I don't want to. Daniel Jones (44:04) Yeah. Yeah, they're not like change resistant. Their livelihoods are not going to be threatened. And we'll have like, we should have perfect observability, you know, we can see in the Cloud Codes, chat history logs contain like a summary of how many tokens and turns are involved. So you would be able to run an experiment through your software factory and see how much like effort it took. And importantly, like Bene | hackers&wizards (44:08) Yeah, yeah, yeah. Yes. Yeah. Daniel Jones (44:30) those agents don't necessarily learn. And if they do, you can like reset it through Git history. So you can run experiments and you can run the same feature through again, but as a control experiment and see like, okay, we're to tweak this prompt and see whether it works better. Bene | hackers&wizards (44:42) Yes, it's my new organization better than my old one. Yeah. Yeah. No, no, no, you can't. Yeah. Daniel Jones (44:45) Yeah, and you can't do that with humans. Not unless you take them into some kind of CIA secret bases and, you know, give them to the men in black flashy thing to wipe their minds. Bene | hackers&wizards (44:53) Yeah. And it's also, yeah, but it's also too expensive, right? So you could do that like an AB testing, right? Like having two organizations doing the same thing and then figuring out which organization worked better because of the different structures that would be like, think back to our, like, and transformation projects and in companies where they change the whole organizational structure and so on. They don't know if it's going to work, right? They can't do experiments at all. can do like, we do it with one team first, and then we see if this one, but one team is different than 40 teams. So and now you can do that. I've actually pitched that November 2025, I was pitching that idea to one of the big five that Daniel Jones (45:26) Yeah. Bene | hackers&wizards (45:33) that they could build a system that can do a B testing on organizational change because they can kind of simulate what the organizational change will end up. Like you said, like you have full transparency. You can see what the result is going to be. And I think that is like far ahead in the future or maybe not that far, but, we will have that. will have these digital organization factories that are able to be a B tested and then figuring out which one will provide the cheaper or faster or better quality result. And then you can pick basically like, I need a fast one today and then grab the fast one and that will do all the marketing stuff for you, but very quickly. And the other one will be more expensive, but have more quality and so on. Daniel Jones (46:16) You mentioned that kind of in the future with software factories and for those of us kind of really immersed in this scene and it is very exciting at the moment. That seems to be like the current sort of edge of discovery, taking it back a bit to because it's very easy for us in this bubble to kind of assume that everyone's building a software factory, taking it back a bit to the kind of people that we end up delivering training to. We talked about Bene | hackers&wizards (46:36) Yeah, sorry. Daniel Jones (46:42) some of the misunderstandings from a developer's point of view or like not being aware of context management or how tool calls work or the difference between what the agent is responsible for and the model. Can you think of any things that like at an organizational level people misunderstand? Like what do engineering managers and team leads, what kind of assumptions they come in that don't hold and that you find you have to correct them on or, you know, enlighten them on? Bene | hackers&wizards (47:08) Mm, that's a good question. I think, yeah. No, no, go ahead. Daniel Jones (47:10) I mean the... Go on. I was going to say, you know, the Dora report on the state of AI software development 2025 was quite useful in this and definitely resonates in people have been developers have a reputation for being quite the word high maintenance. I don't want to use the word diva, but you you must make sure your developers are happy. And therefore there's been a lot of conflict avoidance in software development teams and Bene | hackers&wizards (47:18) Mm-hmm. Daniel Jones (47:43) If for instance, your team doesn't have coding standards or worse still they have coding standards, but they don't agree on them. There is no way that a coding agent is going to be able to deliver code that people universally agree is good. So there's an amount of kind of alignment about what is good engineering that if you, if you are aligned on that, agentic coding is going to make you go much faster. If you're not aligned. then you're going to expose a lot of problems and a lot of disagreements and it's probably going to make you slower. And I think that's maybe an area that engineering management need to think about. You don't necessarily need to get that sorted before you start adopting these tools, but you definitely need to start fixing it as a priority as you start adopting them. Otherwise you're going to have a bad time. Bene | hackers&wizards (48:29) Yeah, that's actually something that really helps me because I was thinking like what to mention because I was going in the more of the adoption area that adopting it is quite hard in the wrong environment. But you are kind of I really enjoy this report because it shows something that is quite obvious but also quite like scary because if you look into it a little like with a little bit of a distance, you see that companies that are actually using well practices already, they are getting much, much faster using AI with this good practices. But if you are not using the well practices that are described in Dora and the Accelerate book, trunk-based development and small batches and so on, we all know this stuff. And then you throw AI at that. then you are getting worse quicker. that is actually first it's like logical, right? So it's like similar to I hire more people and then the same will happen, right? If you are good and hire more people, it will be faster and so on. But what is scary is that this means that this this cliff between the well performing companies and the not so well performing companies is getting even bigger with AI. And that is what we actually also see in reality. And this is also something that brings me back to these engineering managers, because often the first and not only engineering managers, but senior leadership or board, even board members, investors, they say, you need to do agentic coding now. So please hire someone or find someone that will train your people to do agentic coding. And then, and they are talking about the coding part only, right? Daniel Jones (49:58) You Bene | hackers&wizards (50:07) do coding with AI. And the first thing that I tell these people is like, yeah, is coding your bottleneck? Yeah, we have so many stories and so on. and then you look at the architecture, you look at the organization, we mentioned that before, and you immediately see like, yeah, but AI is not your problem here, right? Like, could we please fix the other stuff and then do AI? Or as you mentioned that at least say, let's use genetic AI to accelerate ourselves to this better practices and not Daniel Jones (50:20) You Bene | hackers&wizards (50:34) kind of use it without changing the practices. And that is actually working. I had a hard time the last years to sell organizational change, architectural change, like people are very hesitant to invest in that or to pay money for that. And now with AI, it's like, we need to change the yeah, it's fine. It's okay. Like, go ahead. Like Daniel Jones (50:51) Yeah, and I think that's a really good initial, if you've got somewhere that maybe isn't as mature in terms of methodology as it should be, that's a really great place for people to apply what they've learned about agentic coding is, okay, let's build some tools that allow us to aggregate logs faster so we can debug issues. Let's build some tools that speed up our path to production and get all of that cruft sorted. So that then they can start applying it to their actual production code. And you mentioned the Gulf there about like low maturity teams will get slower. High maturity teams will get faster. I think the both interesting and scary thing is depending on which side of the coin you're on, we're not quite sure where that dividing line is. And so there's this like, we know that bad teams will get slower. Good teams will get faster. But where exactly the tipping point is, is Bene | hackers&wizards (51:17) Yes. Yeah. Yeah. Daniel Jones (51:44) yet to be defined. And you mentioned kind of looking at people's architecture. think that's something that, certainly customers don't necessarily realize is that when you are going into somewhere to like, this isn't just about training and education, it's about behavior change. It's about methodology, methodological change, process change. Like we need to look at what the state of your systems, like how you develop software at the moment to work out how we're to move you from where you are. to where you're going to and looking at things like people's stories in Jira and their test coverage. Anyone, so I guess I was saying the advice to anyone considering going on this journey, whether you engage, you know, the likes of us to either of us to help you with that is start looking at your path to production, your development process and working out like what's not working here and where do we need to improve? Because if you don't do that and you just get by everyone, Claude code licenses. Bene | hackers&wizards (52:11) Yes. Yeah. Daniel Jones (52:37) then you're going to speed up one tiny part of system and it's going to create bottlenecks and problems elsewhere. tell us, what's the name of your company? How do people reach out? if they would want your services, is there any, are you've got any talks coming up or anything like that? How can people engage with you? Bene | hackers&wizards (52:52) Yeah, thank you so much for giving me the opportunity. So yeah, we are hackers and wizards We have a website hackers and wizards.dev And I'm on LinkedIn and I'm going to be at a few conferences again this year So yeah, you will you will be able to find us online if you need help I'd be super happy to follow up on some of the things we discussed and go more into detail Yeah, I think that's that's that's it Daniel Jones (53:17) Cool, right, well, that seems like a nice clean break point. I'm absolutely certain we could carry on talking more, but maybe let's save that for another episode, because it's been fun and I'd like to do it again. Bene | hackers&wizards (53:23) Yeah. Yeah, thank you. Me too. It was super fun. Daniel Jones (53:28) Excellent, thank you very much Benedict. Bene | hackers&wizards (53:29) Bye bye. Daniel Jones (53:32) Hopefully you enjoyed my chat with Benedict there. I certainly did. But then I always say that, don't I? But then, you know, now you're going to think I'm lying. And actually I hate to talk to Benedict. I'll let you be the judge of that. I think you can tell from the energy in the conversation. We talked about a few different blog posts and studies there. There was one that was mentioned that came out of Zurich. And that is a paper called evaluatingagents.md, our repository level context files helpful for coding agents. I referenced the Hot Mess of AI paper that came out of Anthropic. I also mentioned a paper called No Lima, which was about context limits, and that was by Adobe in 2024. And then Benedict also mentioned Measuring AI Agent Autonomy in Practice, which is a blog post based on a paper out of Anthropic that was published in the last few days, this February, so I'm recording in 2026. And also I mentioned a paper called Context is What You Need, the maximum effective context window for real world limits for LLMs, which looks at how much of your context window is actually useful to you. And if you go over that limit when reasoning ability drops. So if you fancy doing some academic research and reading up on those papers, there you go. Hopefully we'll try and put those in the description as well. It would be good to get feedback on the episode. So if you would like to email Waves of Innovation at redashsync.com, that's R-E-C-I-N-Q.com, that would be lovely. It would be interesting to hear what kind of things you'd like more of, whether you'd like to go really in depth on any particular topic, like what kind of things developers should know if they were going to adopt agentic coding, or if you'd like something completely different. Either way, it would be lovely to hear from you. Be good to each other and you'll hear me in the next one.

Episode Highlights
Senior leaders reclaim the joy of building by using agentic tools to bypass environmental setup friction.
Effective training must prioritize the unhappy path to teach developers how and why models actually fail.
Reasoning drops sharply after 30,000 tokens, requiring engineers to prioritize aggressive context curation for accuracy.
Developers must reverse engineer model defaults to effectively steer agentic outputs toward project-specific architectural requirements.
The rise of Software Factories shifts engineering focus from manual coding to designing autonomous production machines.
AI acts as a multiplier, accelerating high-maturity teams while slowing down organizations with existing systemic bottlenecks.
Aligning on coding standards is a prerequisite for successful department-wide adoption of agentic engineering toolsets.
Related Episodes
Share This Episode
https://re-cinq.com/podcast/bridging-the-skills-gap-insights-from-agentic-coding-training



