Get the Complete AI Native Guide: From Cloud Native to AI Native

Get the Book

Podcast

Jan 29, 2026

Evals, reducing hallucinations, & AI-native development

00:00

Evals, reducing hallucinations, & AI-native development

agentic workflows

ai evals

documentation registry

model hallucinations

ai native

system steering

In this episode, Deejay sits down with Amy Heineike, founding AI engineer at TESSL, to explore the structural shift toward AI-native development. They discuss the necessity of machine-optimized documentation registries to eliminate agent hallucinations and the cultural transition from deterministic logic to a biological science mindset. Amy details the mechanics of building evaluation harnesses, the pitfalls of contradictory steering, and how the role of the software engineer is evolving into a high-level architect of intentional outcomes and anti-fragile systems.

Hosted by

Deejay

Featuring

Amy Heineike

Guest Role & Company

Founding AI Engineer @ Tessl

Guest Socials

Episode Transcript

Daniel Jones (00:00) Amy Heineike, founding AI engineer at TESSL. ⁓ It's great to have you with us. What are you and the folks at TESSL doing at the moment? Amy (00:07) ⁓ hi, Daniel. great to be here. Yeah. So we are building tools to help people who are using coding agents every day. So we've released a registry of documentation for open source libraries and we publish them call evals on those to show that it makes the agents work a lot better. So hallucinate API is less and, use libraries as they should be, which is cool. And then we've got increasing amounts of tools for people. So especially maybe in enterprises who are trying to figure out how to roll out their agents and make them work better. So adding documentation for your own repos, other stuff that we'll be releasing soon. Daniel Jones (00:43) The avoiding hallucinations when you're coding to other people's APIs is definitely a useful one. Because the most frustrating thing is that models seem to hallucinate things that it would be really great if they did exist, but they don't. And then when you find out they don't, not only have you wasted that time, you're also really disappointed. Amy (01:01) Yeah, you're really disappointed. I think, yeah, they kind of tell you what they wish the right answer would be. Yeah, so my job at TESOL, so I'm on our research team. And so I spend time trying to work out what on earth agents are doing. So I built up a lot of our eval tooling so we can measure what they're doing, measure what they do under different conditions. So if we give them better documentation or not, for example. Um, and then I'm also prototyping some new ideas that feed into kind of our future looking kind of planning, um, and ideas. I get to kind of tinker with different, uh, ways of using agents and checking if that makes them better or not. Daniel Jones (01:44) Awesome. So that means you get to do the exciting top secret source stuff because I mean, it's such a fast, fast moving area that, you know, I should imagine you're like sticking your flag in the ground of this is our product. This is the only thing we're ever going to do. It's probably not a sensible strategy this early in the AI revolution. Amy (01:49) Yeah. Yeah, I think, you know, TESSLa's fairly early still in its journey. We're still very focused on learning quickly. But yeah, you said, the space is moving so fast. It's completely wild. And I think everyone of us who's working in the space, feel like every two months you have like a major revelation where you're like, oh wait, now I can do, I can rethink how I do everything. I've just been realizing how incredible skills are as a way of packaging up software, playing with that a load. But yeah, it's totally crazy. And I think in a way, I think we're quite lucky being in a small startup because we can just go full hog into everything. So if there's a cool thing that's starting, we dive into it feet first and just try it. We don't have too much legacy that we need to kind of be careful of. And so I think We're taking on some of that experimentation because we then know when we talk to people who are doing this in kind of big serious companies where they've got like, yeah, a lot of legacy or a lot of very high stakes software that they're working with, that they're also wrestling with this beat of change and worrying about what they should use, what they shouldn't and how to reason about that. Daniel Jones (03:18) Yeah. And that, ⁓ just on the subject of the, folks in big enterprises who are like the people in most need of, these tools and this assistance, they're the ones with the least amount of time to experiment because they're under feature pressure. So it's great having, you know, startups that are dedicated to exploring this space to do that work. Otherwise it would just be like the odd engineer in an enterprise on a Sunday afternoon, take some extra time to kind of, to, to look at things. Amy (03:24) Thank Daniel Jones (03:46) With the main TESSLa product, the TESSLa tool as it stands at the moment, would you be able to describe what that does? And then maybe from there, we can segue into why evals are important for you folks and to get the most out of the tool. Amy (04:03) Yeah, absolutely. you can install, we've got a CLI tool that you can install. And once you've got that in, you install, well, you can search for and you can install documentation very easily. And then you can, there's an MCP server that you can start up so your agents can get documentation via our MCP tool where we've got a really optimized way of reading the docs rather than like the agent having to read lots and lots and of text and stick it all into its context and kind of pollute it. can go by this tool to get kind of summary versions. You can also tell the agents to just read the full docs if you want to. So that's the thing that you can just get going on at the moment. You can also go in and if you've got your own repo, you can kind of connect that in and have it make documentation for repos that you're using internally. So there's a registry basically, then you can have private registries on there that you can then start sharing documentation with your teammates safely. So, yeah, one thing that we've realized is that when you're working, Git works great for all of our code, it's like a little bit painful for kind of documentation, markdown files that you would just want to give to agents, because you often want to publish those in lots of different repos at the same time. You don't necessarily want to tie it to it's in one, and now I need to like create some mechanism for copying and pasting these files all the way around. So it's quite helpful to have a separate kind of registry place where you can stick your mark down and then pull it in for different agents. So I've talked a bit about documentation. You can also do this with things like style guides and other kinds of instructions or ways of working that you want to share with your agent. We also have a spectrum development tile that you can install. this is basically documentation that will steer your agent to write detailed specs for any code that you're writing. And we found that that often will help ground the agent a little bit more so you can kind of write down your intent nicely and make sure that that's very clear and then you'll get more predictable behavior when the agent goes off and starts writing code. So yeah, so again, this is a set of instructions you can install for your agents easily from the, this kind of command line tool that you can easily install. So yeah, so all of that is free and then you can also contact us if you want to talk about other kind that you want to roll out internally within your org to kind of make it easier to get agents working well for you. Daniel Jones (06:25) Yeah, so it sounds like you folks are very much interested in figuring out what works for people and open to new ideas. So with the the TESSLa registry, that's primarily around providing more of the right context, the model. So when you're asking to make a code change, the model has documentation or a summary of like the API that you're using. I can't quite remember where I saw it, but Amy (06:48) Yeah. Daniel Jones (06:51) I read somewhere that the kind of sweet spot for hallucinations is anything since the model was trained. So anything that's not in the training set because it's too new and anything that maybe is older than like six years or say, ⁓ because that's going to be poorly represented in the training set. any cutting edge API is if you want to use the latest version of a library, it's quite likely to hallucinate or give you like Amy (06:58) Thank Yeah. Yep. Daniel Jones (07:16) old versions or it will think about old versions of APIs and method calls that no longer exist or have changed. And so you're kind of either you are not aware of that and you let it hallucinate or the, you know, kind of best case you tell it, can you do a web search on the latest docs? Like, I want to use this library. Please go and check the docs and the API docs that this is actually a thing, which is a bit laborious and tedious. So Amy (07:19) Thank Yeah. you Daniel Jones (07:43) The TESSLa registry kind of takes the pain out of that and does that. Is it more of an automatic process behind the scenes that the agent and the model are kind of figuring out for themselves to make use of your tool to find the right context? Amy (07:53) Yeah. So we're automated processes for publishing documentation tied to the version of the packages that you're using. So you can then manage which versions. So you have like a tessell.json file that into your repo that kind of records which version of each of these tiles you want to have installed and ready to read. So yes, we can tie it up really closely. yeah, as you said, I think part of the value is the fact that you're avoiding the kind of hallucinations. The other one is that sometimes we find agents are actually quite smart about, they'll go in like, as you said, they'll search the internet through the docs, or they'll go through like node modules, if you're doing something like TypeScript or JavaScript, and they'll like read the code base of the thing that they're trying to use. It's just not super efficient if you do that. So if you imagine that every time that you're trying to use a library, the agent is going off on this search to get to first principles, like what is this doing? It just doesn't feel that efficient. So if we can do that once basically and turn that into docs, and then you can reuse that every time you need it, that seems a lot more efficient. ⁓ Daniel Jones (08:54) Yes, I'm just thinking about the kind of token usage of having your agent go and like web pages these days. I've got some friends that are like old school web developers who are constantly complaining about web page bloat, but like trawling through all that HTML and probably multiple pages full of it and sending that backwards and forwards to a model going, can you pull the useful bits out of this please? That doesn't sound like a good usage of tokens. And so with the that tool, you've got multiple you've got your tooling, which is creating these documents, these kind of summaries that tiles of the represent different dependencies. But a lot of the kind of ability to deliver value to the user there is dependent on the agent and model. So things you don't necessarily control. Amy (09:26) Yeah. Daniel Jones (09:39) And so I guess that brings us onto evals and like, how do you folks figure out like when you've got different moving parts you're not in control of, like whether the things you're doing are having a positive impact, whether they're a negative impact, whether someone's changed something outside your control and that's kind of derailing things for you. Amy (09:56) Yeah, so I should first of all say this is the situation that we're all in, right? So when we're coding agents, they're so powerful, but also they're changing in ways that are tricky to follow. And they're not always explicit about what they do. And we sometimes find that the docs that say how they work don't seem to totally line up with how they work anymore. So the pace that the labs are going at is incredible. I mean, I Cloud Code has three releases a day or something. We've had major version updates within a month. And then the underlying models are switching all the time. And you can just use a picker and switch between them. And so you've got this interaction between the harnesses and the models. So I think this is what everybody is dealing with. And I think part of what we see the value that we can bring is not magically saying that we're going to solve all of this, making it easy for people to understand what is going on. So like, think if you're rolling this out to a team, you want to have some kind of view of what the hell is going on and where are the problems that you should be worried about, where are the problems you don't need to be worried about. So, um, so evals are really important. So what we can do. So the idea of an eval is you're going to set up like a synthetic scenarios. So we're going to write down a bunch of different scenarios, set up a list of them, and each one of them, we're going to give a coding agent a task to do. There's a few different ways we can pick what those tasks are, depending on what type of problem we're trying to narrow down and understand. So we're going to give the agents a task to do. We're going to watch them do the task. We're going to gather their logs. We're going to gather what they did at the end. And then we're to have ⁓ measurements or grading criteria that then measure, did they do what we asked? then we can get a final score. So we can say in this scenario, this percentage of the time, it matched this score. And because we have a harness around it, we can run it multiple times. So we can measure a bit of the variance that we see. So how predictable it is. Because we have a range of scenarios, we can maybe look at Are there certain types of scenarios that the agents wear better and certain types that they don't work as well? But we can boil it all down to a score. So then you can look, you know, if I add the documentation or I don't, how does the score change? If I change the model I'm using, how does the score change? If I change the agent I'm using, how does the score change? So each bit of that, the design of which scenarios do we use? How do we grade it? What kind of score is useful? we can hone that into the specific question that we're trying to answer. And so we've thought a lot about a lot about this, we've talked about this for hours, but generally that's the approach. we have, you know, these are the elements that we need to debate. What are the scenarios? How are we grading it? And then what are we, what are we scoring or comparing across? Daniel Jones (12:41) It's just interesting that, I mean, that's quite different to what a lot of software developers are used to of like hard logic, like either thing passes or it fails. And if everything passes, then we proceed to the next step where this is much more like the kind of thing that presumably data scientists do of you're running experiments and you're looking at confidence and you the statistics involved and trying to find thresholds above which you believe that like you know, if it works in this many percents of the time, then we're happy with that, which is both a different skill set, presumably, and also a slightly different mindset. Amy (13:16) Yeah, that's such a point. And I think it feels like this is one of the hardest things about moving into kind of agentic software development and into building products that have AI as a big component. It's realizing that you're having to do this kind of mine, this kind of cultural shift almost. So it feels less like engineering. feels a bit more like being a biological scientist or something. I think the reassuring thing, is that, yeah, so in something like biological sciences, where, you know, where do you start? You start with a complicated beast that you're trying to understand. You don't start with small pieces that you build up. It's still possible to get... Daniel Jones (13:44) Hahaha Amy (14:02) a sense of understanding of what's happening, but it just looks very different. So I think, you know, with unit tests, if you fix the unit test, the unit test stays fixed, which is quite nice. But I think for eval cases, what you can find is what we can do is we can build up very granular measures of in this scenario, does this thing work? But there's lots of things that might affect it. So it's not like if it sometimes fails, that's a bug that you can just go and correct. What we're looking for is getting a ⁓ basket where we can say, okay, across this whole selection, what's the kind of average success and can we make the average more effective? Or we can characterize the things that the model is just not good at yet and the things that we can fix and improve? So yeah, it's a very different, it's a different head space to be in. And I think for a lot of people having worked with agents, having worked with AI tools for a while, I think it's starting to percolate in where it's not quite so weird. You you kind of get used to the idea that these models are a bit unpredictable, but it's a big shift. Daniel Jones (15:12) Yeah, and it's just making me realize that like, maybe, maybe, us software engineers, I probably don't get my hand on code often enough to include myself as us as a software engineer. But just the maybe we're the weird ones of like, everybody else has to deal with this, you know, non determinism and having to look at averages, because you were talking about biological sciences, and I was thinking, it's also kind of the same flight product managers, you know, they don't Amy (15:27) Yeah. Hmm. Daniel Jones (15:36) get one like, this feature is definitely what the market wants. It's instead, well, you know, we're seeing this metric go up and this one goes down. think in this place it works and in that place it doesn't. So, um, yeah, maybe software engineers have just kind of had it lucky that we've been able to determine things with such, such a degree of certainty. Um, Amy (15:46) Mm-hmm. Yeah. Daniel Jones (15:56) Yeah, so the kind of tooling there, I think when we've chatted in the past, you've mentioned that you're building your own tooling. Is that right? Amy (16:04) ⁓ Yeah, that's right. we're also, keep looking at what tooling is out there. So there's some projects. There's one called Terminal Bench that we've found very useful that we're using pieces of. So, but we have built a lot of our own harnesses as well. Yeah. Daniel Jones (16:19) So if people were, if they were building agents, so rather than using coding agents, but actually building an agent themselves, and we're trying to work out how to tell if it does the right thing most of the time, do you think that from what you've seen of the marketplace, the open source marketplace, there are kind of go-to libraries that you recommend, or is it still sort of being ⁓ one out, do you think? Amy (16:42) So it's a super good question and I think it's moving so quickly that it's very hard. Daniel Jones (16:49) That's a good point. It takes a couple of days to edit these, so by the time that it's edited, then it's probably going to be out of date anyway. Amy (16:56) I think so it's worth making the point that what we're grappling with is moving quickly too. a year ago, if we were talking about evals, we would be talking about eval individual model calls. And so we'd be thinking about having a set of example prompts that we feed to a model and then we will grade the output of the model. So text in, text out. And the grading that we'd be doing for those output prompts would probably be a combination of LLM as a judge. So asking another model to say, does this match some criteria? Or maybe something like a similarity to a golden answer that we've written somewhere. Maybe there's some other deterministic check that maybe the answer includes certain words or something. So we'd be looking at system like that. And I think those have gotten quite mature. for example, you can just go use like inside the OpenAI platform, they've got a whole eval product where you can kind of capture traces and capture those inputs and outputs. And you can, I think you can label them and you can mash them around. So there's ones inside the labs. This one's like, I think Braintree is one that's popular. We've used a observability platform called Langfuse, which is open source. Um, and that has built into it, uh, an eval's thing where you can see, you can be monitoring traces, you can be capturing them and then you can put them into a data set that you evaluate. So I think it's getting a lot more mature for this kind of individual prompts and then seeing what they do. We have moved up the tiers of difficulties when we start talking about agents. so agents obviously. depending on what you mean when you say agent, we might be asking an agent to do multiple steps of reasoning. We might be giving it a bunch of tools. We might be giving it a load of different files. And it might be running for quite a long time. So Daniel Jones (19:07) You Amy (19:13) Yeah, we might be interested in just what it does. If you give it a prompt, you leave it on its own. We might be interested in what happens if you have a session where you start it running and then the user keeps, it comes back to the user. The user responds, you go back and forth. so, um, it, so this is the level of difficulty. Yeah, we've jumped way up here. So we might have something that's long running, expensive. There might be different parts of it that we care about. Daniel Jones (19:07) You Amy (19:13) So we are seeing more tools coming out around this. So yeah, so terminal bench I think is really cool. So they've released benchmarks. So they've built this harness where you can kind of run big different benchmarks where you're basically setting up an agent, you're giving it tasks to run in a terminal and then you're marking what it does at the end. We've built internally some tools. So I've worked on one that we called AutoCode for a while, which has been fun where we have a simulated user as well as the coding agent. And so it simulate the user, you tell it how you want it to talk to the agent, then you set up your agent and then you have them chat to each other. And so it's super cool. the, you know, simulated users tells the agent what to do. And then the agent has questions. It comes back to the simulated user and the user responds. You watch them go back and forth and then the user decides when the task is complete. Daniel Jones (19:43) You Amy (20:03) And so that tickles the agents a very different way. So we could give the agent an instruction to say, you know, ask, you use a lots of questions before you start the task. So we need to be able to see if it does in fact ask questions for it as task. And so we need to have this kind of automated user as part of the Evo. And recently, Anthropic just released a torch, I think it was called Bloom, which was looking at Very similar. So they have, I think they have the simulator user in there, but looking for how well aligned an agent is with its overall steering. So does it do harmful things and is it rude or, you know, might have these like good behavior kind of metrics that they kind of grade. So I think we'll see more tools that have that kind of thing built in. So yeah, if you, if you're starting out wanting to grade what an agent is doing, the first thing is to think really carefully about what you want to measure and how focused you can be on just measuring that thing. So, and then you can work out what tooling, can we simplify it all the way to just looking at a prompt, in which case the tooling is really easy, or do you want to kind of measure a system that might be more complicated and then maybe if it's very custom, your whole system, maybe the best thing to do is to make your own harness to run it and use. you know, little tools as part of that make it easier. Daniel Jones (21:22) It's interesting that the degree of complexity explodes really. And when you think about the interaction between just an agent and a model, and you're looking at that one prompt and whether you get the right thing back, that seems kind of within the realms of what a lot of folks are used to testing, certainly in enterprise software. And then when you go to the Amy (21:26) in the one. Yeah. Yeah. Yeah. Daniel Jones (21:45) interaction is the agent doing the right thing, then that's another layer of complexity. then the adding in the vagaries of what the user might end up doing. All of a sudden that reminds me of like testing in video games is somewhat hard because the inputs and the response though, certainly in the pre AI world, you know, that was very hard to automate. But you end up with things like people just doing weird and random stuff. I'm going to go off on a tangent now. I was lucky enough to do some work experience at a video games company called Cove Masters, and I was in the testing department for a little while. And testing video games is nowhere near as much fun as you might think. It was great for me as a 15-year-old. But we were play testing Pete Sampras tennis, I think, on the Mega Drive, a very old 16-bit game. Amy (22:11) Thank you. Hello. Yeah. Yeah. Daniel Jones (22:30) And one of the testers ⁓ was just like going mad playing tennis constantly all day every day for weeks on this console and decided that ⁓ in a fit of, you know, mania was just going to walk up the umpire and can constantly swing his bat at the umpire. Found a bug, you know, and the game crashed and presumably, you know, some value overflowed or something like that. you know, how would you think to even test for that? So I think we're kind of in this world where Amy (22:35) So. Yeah. Daniel Jones (22:57) ⁓ You have an unbounded number of possible inputs and the idea of protecting against all of them is probably impossible. Amy (22:57) Yeah. Yeah. Yeah, absolutely. And I think there's two ways I think about that. So one of them is if you can learn, like maybe we want to learn some principles that we can kind of apply and understand. So one thing that we noticed that was very interesting was, so if you give an agent a bunch of instructions, like system prompt kind of instructions, like your cursor rules or your... you know, your Claw.MD or something, if you put a lot of instructions in there. If you then ask the model to do a task and you just give it a very brief description of the task, it will tend to follow all of the instructions. But if you give it a really detailed discussion of the task that you want to do, it will tend to ignore most of the instructions. And so Daniel Jones (23:46) Which is kind of the opposite of what you want. You're going to the effort of like... Amy (23:50) Yeah. you, because, know, the good behavior is to give the detailed task instructions, but when you're doing that, you, you know, I think that in a way for agents, I think that one of the key tasks of an agent is weighing its context and deciding what to kind of ignore and what to put weight on. And so you kind of want the agent to be making these decisions, right? And being like, Oh, if the user was very prescriptive about this task, probably. that's more important than thinking really hard about all of these other kind of rules they gave me, maybe, but it makes it very tough. Yeah, so... So the first thing is if there are a few of these principles that you're kind of aware of, that can kind of help you reason a little bit about what's going on. I think the second thing ties into observability a little bit. So I think, I think it's a really great call out you made about this being a bit like product management world, right? And realizing that users can do all kinds of crazy stuff and we give them an AI interface and they can do all kinds of crazy stuff. Daniel Jones (24:47) Hahaha Amy (24:48) And I think, you know, in the ideal world, you are tracking, you have some observability over what is actually important to your users, what they're trying to do. And then you can see how agents are responding to them. You can kind of watch that flow and then you can go and dig into that and discover where is it going well for people and where is it not going well for people and those are the things that you anchor on so you don't worry about every edge case in the world because you can't you can't possibly map it all but you can think about what are the kinds of behaviors that we that we care about and in those behaviors is the whole thing doing what we wanted. I do think and I think we might have talked about this before There is this interesting thing where in that case, building the eval doesn't feel like this kind of separate technical task. It feels like defining the product. And so it feels like a task that, yeah, you want to have this kind of product design hat on where you're thinking What are the behaviors that we care about? are the expectations? And it's almost like in the past, maybe we used to spec out kind of user journeys and you could show the pages on the website and know, user clicks here, they click here and they click here and then this should all be smooth. Or maybe your QA team is going to like check that that all works. And now we're encapsulating a similar thing as kind of an evil. If a user comes in and asks this kind of question and and they're in this kind of state, here are some expectations around what a good outcome for them would look like. And then if we can capture that as an eval scenario, then we can start measuring, is that happening for them? Or maybe we can measure that in logs and observability, is that thing happening to our users generally? And then we can know that this is something that we can worry about. Like if it's not happening, does that mean we need better models in the loop? Does it mean? any better guardrails, doesn't mean we need to change the user experience because we are making life far too hard for the agents. And maybe if we asked the user to put more information in, it would give them a better outcome or something, but we need to worry about it. ⁓ But this idea that you're kind of flinging back into. Daniel Jones (26:54) Yeah. Amy (26:58) Product land, then design land. think, you know, challenges to some extent, how our teams often work and how we expect to have different groups of people in different rooms and what tasks people are supposed to Cause suddenly you've got this task to do that doesn't feel like product management, right? But it's actually a very big control lever over what your experience will be like. Daniel Jones (27:19) Yeah, there's so many interesting things to unpack there. First off, before I forget, eval driven development. If somebody hasn't coined the term quickly, get the TESSLa trademark people onto that nice and quickly. Because, you use the word outcome. And I think that's, you know, the one of the fundamental challenges I've faced when working with kind of organizations and trying to help them be more effective and doing transformation of organizations. Amy (27:25) You Yeah. Daniel Jones (27:44) is getting them to focus on outcomes rather than deliverables. Like what do you want to actually happen? What is the change you want to see in the world? Not what is the feature you want to deliver? And, you know, maybe if we kind of zoomed out a little bit to thinking evals first of what is a good outcome here, ⁓ rather than the specifics of the journey, ⁓ then that would help people focus more on delivering value rather than delivering features. Amy (28:02) Yeah. Mm. Daniel Jones (28:10) And you also mentioned the fact that there's kind of a rejigging of responsibilities. And when we look at how AI is adopted in organizations, whenever any change comes in, the first thing people do quite understandably is they want to use the new technology within the boxes on the org chart. Like we're not going to change anyone's roles. We're not going to change any interactions between units in the organization. we're just going to use the new tech inside this box. Amy (28:16) Mm. Daniel Jones (28:35) And then slowly like boxes start to merge and people shift between them. And maybe with, you know, all the crazy stuff that's happening with things like a gas town and agent factories and being able to give a whole product requirements documents to a set of agents, have them fully implemented. And at the same time, you've got product managers who can, you give, just do stuff in lovable without having to involve engineers. Maybe all of these functions that used to be separate are kind of. Amy (28:39) Hmm. Daniel Jones (29:02) kind of coalesce into some sort of product engineer outcome focused person who's going to be thinking about what are we actually trying to achieve here for users? What would a good outcome look like? And then I'm to use tooling to achieve that, or whether it's, you know, I'm going to get core code to do it, or whether I'm going to use a low code, no code prototyping tool. It feels like there's an opportunity there for things to shift and boxes to merge. Amy (29:28) Yeah, absolutely. yeah, maybe smaller teams that have kind of these mixed skill sets as people are kind of learning them, be able to ship a lot more together. So you're kind of, you're changing the length of the iteration loop and who the iteration goes through. Yeah, that's really such a point. Daniel Jones (29:45) It's an exciting time to be in software development. ⁓ Amy (29:47) It's really crazy. Yeah, it's super crazy. I think, yeah, so thinking about outcomes. I think the reason, well, I think outcomes are really hard, aren't they? And you know, what struck me, so I've been working in machine learning for... Daniel Jones (30:02) since before it was cool. Amy (30:04) Way before I was called, remember having a job. So I created a mathematician. I remember my first startup roles in 2008. And I remember there being the recruiter. I think she forgot I could overhear her while she was on a call. we mentioned, oh yeah, we've got a mathematician on the team. We don't know why. I don't know what they're doing here. Daniel Jones (30:16) I'm Amy (30:22) Yeah. So before data science was even coined. So yeah, very long time. I think like I've, so in this time, I've gotten to work with enterprises who are like going to use AI to automate something and all the generations of it. The hardest thing about automation is realizing that you need to be precise about what you're doing. And I think what you realize is that there's a lot of business processes where we're not very precise about what we're doing. And we're relying on people being given conflicting instructions and just doing something. And so we used to measure, we had some tasks where we had to like label documents and say which ones were for certain type that needed to get flagged. And we'd have experts from enterprise go to label the documents and they would label them. The inter annotator agreement we would measure and it'd be like less than 50 % which I think all the things that they actively had different definitions in their mind. They were actively labeling the documents in completely different ways. I think a lot of, you know, lot of businesses, it's very hard to know exactly what you're doing. And there's a lot of ambiguity there and people are kind. fiddling around, know, fudging their way through. As soon as you kind of automate a chunk of this, it's a bit scary because you're encoding this into a system that maybe it's not deterministic, but it's maybe principled where it's going to then be repetitive in some way. And so you want to be really careful about it. And in a way it's good, right? Like we're saying a load of the busy work we are going to push to the agent, but we're going to have to spend more time grappling with what are these outcomes? How do we define them? And maybe because we're going to be able to measure them more precisely, maybe more of our lives is going to be at this level of defining outcomes, making them so, and then realizing you get what you asked for. Maybe that's not quite what we meant. And then iterating on these outcomes. So defining them, measuring them, updating them, rolling them out, changing them, seeing if that feels better and that kind of loop. There's something very interesting there. ⁓ I think as well, what's kind of nice in the outcome side is because we also have agents that help can help us do evaluating the outcomes. don't have to be super precise. You know, we can use agents models as judges, looking at text and outputs and agent traces and coding things. And they can measure things that were hard to automate before. coding style, for example. we've had linting rules forever, but you can measure if coding style is followed and encode if that's something you care about, but in a softer way than we used to be able to do. yeah, so we're going to put our heads into this outcomes world and we can measure things that we didn't used to be able to measure. Um, need to be iterating on those as we watch them roll out. Daniel Jones (33:03) and maybe that Maybe that desire for precision has always been misguided. So when you were explaining there about the agreement levels between different evaluators being less than 50%, it reminds me a little bit of, before we started recording, we were chatting a little bit about Gastown, which for listeners who are unfamiliar with it, Steve Yegg's new... Amy (33:16) . Daniel Jones (33:29) agent factory agent orchestration framework that is completely bonkers. It's got all the most ridiculous names in the world, but can achieve a great amount of work by having lots of agents spinning up trying different tasks, having different responsibilities, and occasionally they do the wrong thing. And occasionally they need kicking and occasionally the work needs to be redone. Similarly, with Jeff Huntley's ⁓ cursed language where he coined the term Ralph Wiggum loop of having Claude code in a loop for three months, trying to implement a new programming language in a compiler. The reason that those things have worked is because of the non-determinism. If they were deterministic, there is a chance they would code themselves into a corner that they couldn't be nudged out of. And in the Ralph Wiggum loop, Although I remember reading that Jeff saw it kind of get stuck for a number of hours, it would eventually nudge itself out that local maxima. And so I wonder if the fact that in human systems and businesses that operate well, there is some amount of variability in diversity of like how decisions are made and whether people do the right thing or not. maybe that's a feature of, you know, of anti fragility and things generally working most of the time. And then when you compare that to like, I don't know, business process system, where, okay, we're to change the rules. And if it breaks, it breaks everyone 100 % of the time. So production outage, quick, we got to go and fix that. Maybe we just need to recognize that trying to get everything working 100 % of the time, exactly the same way is not a feature of Amy (34:35) Yeah. Daniel Jones (35:01) you biological systems or human systems or complex systems. And we just need to become more accustomed to that. And it would be interesting to see whether like with the gas, gas towns and the Ralph Wiggum loops, whether like what we're seeing is that being proven kind of experimentally that the non-terministic means are the way to achieve largely consistent outcomes. Sorry, I've gone on. Amy (35:26) I love it though. I love it. It's so good. Yeah. And I think, you know, when you look at like the large scale social systems that we've kind of invented, well, not invented, but we're part of, so you look at some like evolution, look at some like capitalism, like market economies, maybe. Sometimes it's these systems where there's a bunch of try things and then the thing that seems to be working, you pile onto. Daniel Jones (35:26) philosophical tangent there. Amy (35:49) that work quite well. It's like genetic algorithms back in the day. I don't know if you ever tried those out. So, this idea you kind of randomly modify an algorithm and then you pick whichever one's kind of working better and then you randomly modify that. You pick whichever version works better and you just evolve an algorithm and then surprisingly effective way, creating effective algorithm. So yeah, I love it, like embrace a bit of the chaos if it feels... I think there's another piece here as well where I think maybe we don't know what makes humans so great and we like to think of ourselves as these rational beings that are, you know, careful and... thoughtful and it comes back to some kind of first principles and then you realize when you try and build systems that actually sometimes that wasn't what was making us great at all. ⁓ you realize that the bar that you were trying to meet is in a completely different place than you realize. Like maybe it's much easier to kind of replicate some human behavior than you thought possible. Like it feels like the bar is way lower, but then somehow the system is more stable. And so the bar is way higher than you thought in a different way. So here, some of this is grappling with this kind of what is good behavior. And yes, to your point, maybe not rule following is a big part of it. Daniel Jones (37:00) And... I'm thinking back to my teenage self and yeah, not rule following was the thing I was particularly good at as any of my sixth form teachers for that very short period before I got asked to leave would attest to. mean, just continuing to think philosophically, one of the things I find fascinating is I think we're learning more about ourselves with the, you know, this fairly limited, ⁓ limited form of AI, you know, I, two things, one, Amy (37:07) Yeah. Daniel Jones (37:27) I realized that I often dream about reading and I have been dreaming lucidly enough to realize that basically my brain's just doing exactly what an LLM does of like imagining sentences, one word after the other. But kind of more kind of, applicably to agents and software development. ⁓ you know, there are lots of thoughts that pop into your head and then there's a central governor that goes, no, that wouldn't be appropriate to say you shouldn't do that thing. Amy (37:44) Hmm. Thank Daniel Jones (37:53) And so when we were looking at coding agents and other agents, the idea that the first thing that comes out of a model should be definitely the thing that you do. And it shouldn't be subject to additional checks. you know, you know, maybe another model kind of going, this appropriate to respond or return to the user? Is it inappropriate? It seems like we have those kinds of feedback loops, in our psychology. So it would follow that it would be a good thing for agents to have those as well. Amy (38:18) Yeah, absolutely. And I think, I wonder whether what we're find is that over time what becomes more and more important is how we verify. So what kind of checks we put at the end that say, is this done? Is it good enough? Yeah, and the tighter we are, again, this is the outcomes point, but the tighter we are about that, the more freedom we can give into the process that gets there. So we can have more chaotic systems and more Ralph Wiggum head banging going on or whatever it is. We can try out these really kind of crazy things because we've got something that we can come back to where we're like, okay, we've got some way of measuring, did we get what we wanted? that becomes more more the thing. and those things becoming... Daniel Jones (38:58) Yeah, as the Amy (39:02) you know, not maybe worrying about the unit tests throughout the system as much because they depend on how we've designed the system. So it's hard to work those out ahead of time, but more high level, um, the ways of measuring what the software is doing, what it would be like to work with it, that we could turn back into things that we, um, that we measure to see if it's done or maybe even become tools that we give to the agents themselves to measure, is this done? Is it good enough? So they can kind of keep going until we hit that. Daniel Jones (39:30) Yeah, ex colleague of mine, Chisa Noabara, always used to talk about intentionality and she was sort of head or director of products. I should really know the answer that. Sorry, Chisa. But she was always insistent on intentionality being important. We need to know what outcome we're driving for and how are we going to know when we've done the right thing, which is a conversation that is shockingly lacking in a lot of enterprises. Amy (39:36) Yeah. Thank Daniel Jones (39:54) But if you had that, then that makes proving that your agentic nondeterministic system is generally behaving well, lot easier. And to your point, you know, once you've got that data signal, then presumably you could start automating part of the tuning process and having, ⁓ agents that are tweaking the prompts of other agents to make them better over time. Amy (39:56) Thank Yeah. Yeah, that's right. think the other thing, maybe one of the things that this makes me think about is, you know, it's interesting comparing big enterprises and the different ways in which they've chosen to work. So do you remember that picture a long time ago where They had like the diagram of Microsoft and it was like all the departments were like pointing at each other's head. And then the Google was all these like different boxes of like the IBM. Like these companies work in very, very different ways. And I think most companies you kind of make these choices where you pick your values or you pick some of your ways of working. for example, do you move fast and break things or do you check really thoroughly because you want to have a very, you know, you're working in a high stakes environment, you need a very low risk product. you know, what are the decisions that you make that you roll out into your org about how you want to work. And some of those are leaps of faith, right? You're sat there and you're reasoning, this is how I want my organization to work. Because I think if we work in this way, would be better for our customers and we'll be more successful overall. And so it might be that there are times when the outcome that you're measuring are the very high level ones where you're going to say, this is the final thing. We make money eventually. don't have to be, maybe this is the only metric that matters or, know. Daniel Jones (41:23) You Amy (41:27) retention rates agree or whatever, that you have these very high level measures. But you could also be guessing things, you could be making bets on things that are much smaller. So we think we are gonna be successful if our coding agents make very nice modular code that has very high test coverage. Or we use this microservice pattern or we use this style of engineering. So the outcomes could be very high level ones, but they could be these ones that are kind of a bet. And then you measuring you roll those out, you give these tighter guardrails. think deciding what level of resolution you want to make your guardrails, I think that's going to be a measure of how good the agents are. So right now it might be that those outcomes have to be very tight and close to the task. And we're to spend a lot of time crafting very narrow ones because the agents can't go that far. And then as the agents get better, maybe we're going to be crafting ones that are more organizational or more system level and we'll let the agents fill in more of the details. Who knows? It feels a bit like this. Again, reasoning about the outcomes, which ones are bets that we're making, do we want to track how they're working, which ones are high level. I feel like this is going to be the core of the game. Daniel Jones (42:31) Cool. So earlier you mentioned that one of the things you discovered in the kind of evaling, is that a verb, the testing that evaling the kind of assessment of, you know, your own products that having lots of guidance, lots of steering for an agent combined with lots of very detailed instructions in a particular prompt, like do this thing in this particular way, or here's the spec for what I want you to build, led to poorer results. Are there any other? ⁓ Amy (42:38) You and ⁓ No. Daniel Jones (42:58) kind of things that stick in your memory that you found through that process that maybe would be more widely applicable to your average developer, either writing an agent or using one. Amy (43:02) Hmm. Yeah, so I think one thing that we found is that often what happens is over time you accumulate in your agents.md or whatever more and more rules that you're giving to the agent and often they become contradictory to each other. And so when we've had the chance to go and look at different enterprises, code bases, often find there's a bunch of contradictory rules here. There's out of date rules. There's stuff that made sense at one point, doesn't make sense anymore. And so I think, I think you sometimes want to put yourself into the position of the agent and say, how easy would it be for a person given these instructions to follow them? And does it make sense? So contradictions within steering make it much harder for an agent to know what to do. You're putting them in the position of kind of weighing and deciding what to follow. Sometimes we find contradictions can be quite subtle ones. So it might be that one document says, there's two acceptable ways to throw errors. And another one says, only use this error pattern. So we found this with style docs where we've got kind of global style documentation and we've got repo specific style documentation. And it's not obvious to the agent which one it should be following. And it makes it less likely it will follow any of it. So I think, so that's definitely a pattern. So we've found even if you're spending all your time writing documentation for agents, you need to be self-reflective over what you've ended up, what the accumulation has ended up being. Make sure that makes sense. Daniel Jones (44:36) And I can't remember if it was in the conversation as recorded or when we were chatting beforehand, you mentioned skills and skills in Claude code, which have this property of being of having progressive discovery of a lot of the guidance is only loaded once the model has decided it wants to use that skill. And presumably folks would be better off limiting the amount of guidance that they provide that unless it's Amy (44:44) Yeah. Daniel Jones (45:02) Genuinely universal of if you've got guidance that only applies in you know half the time or one one activity out of ten That sounds like certainly if you're using clawed code that would be or any other agent that supports skills It would be better kind of carved off and put into a skill. So it's only Considered when when it's applicable Amy (45:22) Yeah, I think that's a really good point. And I think what's one thing that we found with that is that the progressive disclosure. So I think at the top of the skills file, it has the word description. You meant to write out the description. Description is kind of a misnomer. In a way, what that is, is the pitch to the agent of when to read. It's kind of the trigger. What is the trigger for the agent of knowing that it should read that doc? So, so yes, putting it into kind of a progressive disclosure is nice because you preserve the context of the agent and make it less likely it's going to contradict with other stuff and get confused. The risk is that the agent never bothers reading it. So we found a lot of cases where we've written rules and clearly the agent is not even bothering to read the rules. Daniel Jones (45:58) You Amy (46:06) So I think it's almost like an SEO play that description. Like you want to really pitch carefully. When should the agent then realize it does need to read it? ⁓ so it gets it into memory. So yeah, I think a finely tuned system. Totally agree with you. You would have thought very carefully, not only about what kinds of instructions you want to give the agent, but when do you want it to read them? And then you'll have measured have you kind of sold it to the agent in a way that it does actually choose to read it at the points you do. We've found this pretty big differences between models in how likely they are to follow instructions to use different tools or yeah, choose to open up and read documents. And that makes it quite tricky actually to build because you're trying to reason around, okay, if I hide it in this way? Can I activate it? Can I the agent to reliably actually do this thing? Daniel Jones (46:56) Yeah. And I can imagine that Anthropic probably have a bit of an advantage there in that, ⁓ with the kind of, ⁓ widely popular coding tool, that most of the coding use cases, are going to be theirs. Therefore they have the most amount of data to train on. what was I was going to say? There were two things. One was intelligent and the other was intelligent, intelligent for me. The other one was banal. have you heard of the whole, ⁓ Van Halen Brown Edmonet? Amy (47:00) and Daniel Jones (47:22) brown &Ms idea in your agent steering. Amy (47:24) Yeah. You know what, we were talking about the Van Halen thing yesterday is like a, so we realized cursor, it's very hard to know if a model in cursor read a cursor rule. It's not in the logs. It might not be possible to know that, that your model read your rule or not. And so we were talking about &Ms because we were thinking What you tell me what you were thinking, but it kind of struck us as this, like, if you can't tell if an agent has read it, what could you do to try and get that signal? Daniel Jones (47:49) So. Cool, so a bit of background for listeners. In the 1980s, the rock band Van Halen had in their 50 page contract, one clause that was to do with their rider, which is the food and drink you receive as hospitality when you're a touring band. And that clause said that they should have a bowl of M &Ms and all of the brown ones should be removed. On the face of it, this seems like rock stars being divas. but it was actually a fast feedback mechanism so that they could arrive and determine whether somebody had actually read the contract or not. And there was one case where they turned up, there were brown &Ms in the bowl when there should not have been. At the same time, the road crew were starting to assemble the stage. The floor was not strong enough to hold this stage because the weight ratings were wrong. $75,000 worth of damage was done, but the band could say to the venue, this is your liability. We can prove you didn't read the contracts and follow all of the instructions. So what's this got to do with agents? One of the ideas I saw floating around social media, kind of autumn last year was that you should give your agent an instruction of always start every message with this emoji. And then when you see it starting to respond without that emoji, you have a sense that the Amy (48:40) you Mm. Daniel Jones (49:07) models context was getting overloaded and it was no longer consistently following all instructions. So then you've got a kind of fast warning mechanism that, ⁓ if it's not following this instruction, it might not follow the ones saying, don't comment out the tests and so on and so forth. So I don't know how well that actually works in practice. And whether I don't know whether you've done any scientific analysis of it. Amy (49:25) and Daniel Jones (49:29) there is an argument that giving it frivolous instructions might nudge it down a slightly different path and you're also just wasting context ⁓ telling it to do something that's a bit daft. Amy (49:35) Thank So one thing we've also, another thing we've also found is we've taken instruction files and turned them into every single instruction is a different thing that we track adherence to. So it might be that yeah, in your file, there's effectively there's 10 rules in there. We track all 10 of them. And you can find that in one file, you'll have certain rules that are consistently followed and other ones that are consistently ignored. So I think, yeah, so I think one problem with this would be if you stuck that in there, if it was followed, then you would know that at least the agent read the doc, but like you wouldn't know that it therefore also decided to follow the other rules. So it gives you a bit of a signal, but not a lot. And it wouldn't necessarily tell you that if it didn't follow the rule, it hadn't read the doc. So your point might've ignored it. I think, Yeah, is, there's definitely, there are some rules that the agents are more inclined to follow than others. and I think we're still trying to work out why, why that is. Now, I think you could imagine that there's going to be some hierarchy of like, some things are more difficult to do. Maybe some take a lot more cognitive load. So if you're trying to do thing A and at the same time do thing B, like it might not be hard to stick an emoji in. But it might be very hard to say follow test-driven development when building something where it's not obvious how to write tests. And so maybe the model is going to decide, ⁓ forget that. You know, going to start the other way around. Sometimes it might be that there's something that's so trained into the model that it's just, even if the only thing you tell it to do is this one thing that is kind of against its training, it's going to... completely refuse. So I had a moment before, uh, before Christmas where I was built building a tool that was going to call open AI. And I think I had Claude building this tool and I said, okay, yeah, call it call it put four Oh as the model to use. I was like, don't use four, use 5.2. And it was like, Oh no, 5.2 does not exist. I will use four. Oh, and I was like, five point does exist, use it. And so it said, okay, fine. I'll use 5.2. And then it went and Daniel Jones (51:32) Yeah. You Amy (51:43) put in 4.0. Like it pretended and it didn't do it. So there are some things which are like, you realize like, hang on, we need to play a different game here. This is like, not a casual thing. I need to really go back and work out where to insert this instruction and like how to explain it to it so that, you know, this is really hard in there. Daniel Jones (52:05) Yeah. I mean, that says something about also like you know about these systems and could spot that and aware of it. I think a lot of frustration that people feel is when they over, what's the word? There's a word, but I can't remember what it is, but they imagine that these systems have more self-awareness than they do. Now an agent in a loop you know, via chain of thought can go back and forwards to the model. And then the model might pick up that in a previous message, it said something that wasn't true or was inaccurate. But just in one turn, there's nothing checking that what it said was accurate, that it's not reflecting upon itself. You know, when it said, I'll use 5.2 and then didn't. When people get frustrated with that, expecting it to have the kind of reflection mechanisms of a human, I think Amy (52:37) Mm-hmm. Mm-hmm. Daniel Jones (52:52) they end up on a, you know, a downward spiral. I remembered what the mildly intelligent thing I was going to say was. And also to your point about following different instructions and having contradictory ones and them being difficult to follow at the same time. think that with coding agents in particular, I think we give them too much of a tough task to do of trying to consider all of the concerns at all of the times. Like as a raging simpleton and a developer myself, like, you know, there are lots of things you have to do. Like the first one is discovery of how do I go from these like vague requirements into the irreducible computation that is minimally complex that actually needs to happen. And that's like discovery. You can't tell how long that's going to take, which is why all productivity metrics of developers, a waste of time because it's discovery. If you knew how you do it, you'd probably done it before. Side thing. But when you get to writing the code, it's like, I've got to make it work. I've got to make it work reliably. I've got to make it work maintainably. I've got to make it work readably. I've got to make it work securely. I've got to make it work performantly. And trying to get an agent, if you stick all of those things into your steering documents, and try in one conversation, one conversation history, try and get a model to think of all of those different concerns in one go. I think that's too much. And what we probably need are things a bit more like Gastown and maybe even smaller, finer grains than that of multiple passes of like, write the code, make it work, then hand off to another agent or the same like agent, but with a clear context. So it's like a fresh set of eyes. Now make it readable. Now make it performance. And having all of those concerns at one time, I mean, it's far too much for my simple human brain. And evidently it's possibly too much for some models. Amy (54:40) Yeah. Yeah, I think that's right. You know what's cool about this? think we worry sometimes, I think some people are worried when they start using agents. They're like, I'm a good software engineer. Why am I letting the agent do all the interesting stuff? And it feels like, you know, I don't want an agent gets to do this instead of me. I think when we talk about these problems, you realize there is a whole new level of engineering to do that's still engineering, where we're going to be engineering systems where coding agents can be successful at building the things that we wanted. And there's a bunch of problems to solve, a whole lot of problems to solve around What is the design of the system that works well for my organization, given that we care a lot about security or given that we're refactoring a big code base right now and we've got mix of patterns and we want to steer everything towards using this new pattern or given that we, know, I mean, whatever it is, whatever it is that we care about figuring out how you engineer it. So I think it's quite an exciting time to be an engineer. actually I think there's like this whole slew of new kinds of problems coming and I think that makes me quite excited. Like I think some of these things do not have easy answers because they're about yeah doing what engineers always do right which is There's this, ideally you've got this kind of user story in mind. You've got kind of an understanding of what your business is trying to achieve. And then somebody said, wouldn't it be nice if this kind of thing happened? And then you're translating that, as you said, into a concrete thing that I can build right now. And the hardest bit was it was connecting it back to, is this really the right, am I solving the right thing? Am I building it in a way that we'll be able to build off on top of in future? And I kind of progressed that in some ways like. There's a whole new set of new problems to kind of start learning how to navigate and tools that means that some of the problems we used to spend a lot of time on, don't have to worry about as much because the model will get there. You know what mean? Remember what this is text and stuff. But I think it's just staggering how quickly this is going. And I think you want to kind of get yourself onto the train where you're playing with it and kind of go go carefully and be talking to people and seeing what's there and seeing what works with you and kind of evolving it. Daniel Jones (56:56) Absolutely. We've done just over an hour. I am aware of the fact that you mentioned being on call and we've taken up a lot of your employer's time. That's good. So if people wanted to check out a TESSLa and TESSLa registry, what should they do? are there any, have you got any speaking engagements coming up? Have you written any books, anything like that? Amy (57:05) Luckily I have not got engaged. So yeah, so TESSL.io, so it's T-E-S-S-L, have low on vowels. So if you check it out, we have an incredible team that's making blogs and news articles, we run our own conference. We have a number of people on the team who are just outstanding public speakers and who spend time not only describing what TESSLa is doing, but also just trying to describe what it's like to be an AI native developer. And so they use all kinds of different tools and write and think about them. So there's a load of content. If you go on there, we've got a Discord you can engage with, you can ask questions, share problems. We're super curious to learn about what problems we buy having and what would be useful to them. Any feedback, like if you use the registry and you hate it, it doesn't work in Europe. Love to hear it. We're jumping on things very fast. The team's very responsive. So yeah, there's lots of ways to follow up. Daniel Jones (58:20) Cool, excellent. Yes, and ⁓ the AI native dev podcast is hosted by one of Amy's colleagues. So you should definitely check out that. And I've listened to quite a few episodes of that myself, but ⁓ gone. Amy (58:30) And one last thing, we also have an event space in London. So we're in Pentonville Road. We hosted a huge number of community events. I think something like crazy, like a hundred, like there's at least one or two a week that's are in the office for not just our ones that we're organizing, but we're hosting events that other people are organizing. Again, generally if people thinking about AI in the AI space. I'm sure you'll be, you know, there are opportunities to come into the building anyway, and say hi. Daniel Jones (59:04) Excellent. Right. Amy, thank you very much. It's been a lot of fun chatting and hopefully I will speak to you again soon. Amy (59:10) Sounds wonderful, thank you so much. Daniel Jones (59:12) Cool, cheers.

Episode Highlights

TESL builds documentation registries to ground coding agents and stop API hallucinations in enterprise environments.

Moving to AI-native development requires shifting from deterministic logic to biological science and probabilistic experimentation.

Evaluations measure agent success across baskets of scenarios rather than traditional binary pass-fail unit tests.

Hyper-detailed task prompts paradoxically trigger models to ignore broader system instructions and core steering rules.

The software engineering role is evolving into a Product Engineer focused on high-level intentional outcomes.

System non-determinism acts as a feature enabling anti-fragility and escapes from logical local maxima.

Multi-pass agentic loops manage distinct concerns like security and performance more effectively than single prompts.

Share This Episode

https://re-cinq.com/podcast/evals-reducing-hallucinations-and-ai-native-development

Free Resource

Master the AI Native Transformation

Get the complete 422-page playbook with frameworks, patterns, and real-world strategies from technology leaders building production AI systems.

Get the BookGet the Bookfree-resource

The Community

Stay Connected to the Voices Shaping the Next Wave

Join a community of engineers, founders, and innovators exploring the future of AI-Native systems. Get monthly insights, expert conversations, and frameworks to stay ahead.

Evals, reducing hallucinations, & AI-na... | re:cinq Podcast