Deep Engineering #42: Building Reliable Multi-Agent Systems with Fitz Nowlan

How to preserve facts across agent handoffs, when to use MCP, and why bounding execution is non-negotiable

Apr 09, 2026

Building an AI-Powered Internal Developer Platform from Scratch

Hone your skills to design, implement, and ship an AI-powered internal developer platform from scratch with Ajay Chankramath.

📅 25th and 26th April 2026 | 11:00 AM to 3:00 PM ET

Use code DEEPENG50 for 50% off, or grab the 2-for-1 deal

Includes access to The Platform Engineer’s Handbook by Ajay Chankramath upon release (eBook worth $35.99)

✍️ From the editor’s desk,

Welcome to the 42nd issue of Deep Engineering!

Multi-agent AI is widely being adopted in production, but the engineering reality is messier than the demos suggest. GitHub Engineering published a post by Gwen Davis laying out the most common ways multi-agent workflows fail in practice. Most failures come down to missing structure, not model capability.

Agents make implicit assumptions about state, ordering, and validation at the points where they hand off work to one another, and without explicit data formats, typed interfaces, and defined action schemas at those boundaries, the system breaks in ways that are hard to reproduce and harder to debug.

The architectural decisions you make around the model matter more than the model itself. That is precisely what today’s issue covers. Fitz Nowlan, VP of AI and Architecture at SmartBear, has spent years building agentic systems in production.

Let’s get started.

Featured Newsletter: The Cloud Playbook

If you own reliability, cost, and compliance in production on AWS, The Cloud Playbook newsletter is worth your attention.

It is a weekly read for engineering leaders building platforms where incidents, audits, and surprise AWS bills are real problems. It covers AWS architecture, platform engineering, FinOps, security, observability, and compliance frameworks like FedRAMP, HIPAA, and ISO 27001.

Subscribe to The Cloud Playbook

Multi-Agent Systems Need Rules to Stay Reliable in Production

by Saqib Jan with Fitz Nowlan

As agents pass tasks to one another, reason autonomously, and compose solutions on the fly, it is easy to construct a version of multi-agent AI that looks technically sound on a whiteboard. In practice, however, that version tends to fall apart when context gets lost between handoffs, when loops run without terminating, and when open-ended systems become progressively harder to test, debug, or trust.

Fitz Nowlan, VP of AI and Architecture at SmartBear, has spent the past two years building agentic systems in production, and his view, drawn from that engineering experience, is that the reliability of a multi-agent system is determined less by the model you choose and more by the architectural decisions you make around it.

Preserving facts across the handoff boundary

The first place multi-agent systems break down is at the handoff. When agent A passes context to agent B, the question is not just what to pass but how to pass it. Raw string prompts are the path of least resistance, but Nowlan argues they are also the path most likely to introduce errors.

“We are almost always using some form of structured data,” he points out. “It’s generally going to be JSON. What we’ll often try to do on top of that JSON is define, effectively, a domain-specific language, or DSL, or our own schema, to cache, kind of, to preserve truth that we’ve already determined.”

The distinction matters because of what happens when you do not preserve that truth. Suppose an agent is working within the context of a browser and identifies that two elements appear inside the same container on a page. That relationship has been syntactically proven through a screenshot, the HTML, or the DOM. If you pass that context as plain text to the next agent, that agent has to decompose the text and then probabilistically reconstruct whether the relationship exists at all.

“That other agent has to effectively decompose and then probabilistically, i.e., potentially hallucinate, that relationship is there,” Nowlan notes. “So what we try to do is communicate over JSON and lock in to the JSON in our custom schema the facts that we’ve identified or extracted from one context or world to another.”

The practical principle is simple enough. Anything that has been established as a fact should be encoded as a fact, not as prose. Structured contracts at the handoff boundary are not just a formatting preference. They are a mechanism for preventing the downstream agent from having to guess at things the upstream agent already knew.

When MCP outperforms a static API wrapper

Model Context Protocol (MCP) has become a standard topic in agentic architecture discussions, and anyone planning to use it should understand where it actually adds value and where it does not.

“If you knew all five workflows that your customers ever did in your application, well, then you should just statically code them up,” Nowlan reasons. “Make it a single API endpoint, take the inputs at the outset, chain them all together, and spit the outputs back out to your user. That would be highly efficient. You wouldn’t then need MCP or agents at all.”

The benefit of MCP is the emergent use case. That is the workflow your users need that you did not anticipate, in the sequence they need it, composed on the fly from the tools you have exposed. By giving the model a set of tools rather than a fixed pipeline, you allow the system to compose solutions for problems you never explicitly programmed for.

“If you have an application that’s very diverse, with a diverse set of data, with a wide-ranging type or class of data, and maybe even different user roles that are coming in and using your product, that’s where MCP can really be a massive unlock, because the composition artifacts are effectively infinite,” Nowlan observes.

The corollary is equally important. If your application is relatively rigid and your users reliably need the same five things, MCP introduces complexity without adding value. The decision to use it should follow directly from the diversity of your use cases, not from the fact that it is available.

The 80-20 rule for DAG versus autonomous orchestration

One of the more practically useful frameworks Nowlan describes is how his team decides when to use a fixed Directed Acyclic Graph (DAG)-based workflow versus an open-ended MCP-style loop, and how that decision evolves over time.

The starting point is openness. When a new feature or product area is introduced, the team tends to begin with the full MCP loop. The premise is straightforward. Here is the set of tools. Let the system compose whatever the user needs. This preserves maximum flexibility while the team learns how customers actually use the system.

Over time, patterns emerge. Certain workflows appear repeatedly. Those are the candidates for promotion into a DAG.

“There’s kind of an 80-20 rule here, where 80% of the time your customers are looking to solve their problems with 20% of these key workflows that you’ve identified,” Nowlan notes. “Those should then be translated into DAGs, into workflows, into a little bit stricter information flow architecture.”

The payoff is significant, Nowlan affirms. Once a workflow is on a DAG, you can optimise aggressively. You can use a cheaper model for nodes that do not require frontier-level reasoning. You can shed context that is irrelevant to a particular step rather than packing everything into the window. You can bound latency because the execution path is known in advance.

“When you get on that track, that predefined workflow, that’s where you can save cost, you can potentially use a cheaper model for a particular node that you know doesn’t need the most expensive model. And you can also shed information from context in that workflow when you know it’s not necessary for the outcome.”

The open-ended MCP loop remains available for everything else, for the 20% of use cases that do not fit a known pattern. The architecture supports both, and the team actively monitors usage to identify when a new workflow has crossed the threshold and is ready to be promoted. There is no firm cutoff, just the observation that a workflow is appearing often enough to justify the investment in making it faster and cheaper.

The infinite loop problem

Bounding execution is one of the more fundamental challenges in agentic systems, and it is one that the industry is still working through. The early solutions were basic by necessity. At the most fundamental level, Nowlan explains, it comes down to two questions: has the system hit a timeout, and has it exceeded a set number of attempts.

The deeper issue is that the problem did not originate with modern agents. In the early days of GPT-3-era pipelines, a common approach was to ask the model for one action at a time, execute it, and return for the next. The loop would never terminate because the model had no reliable sense of when it was done.

“We quickly realized they would literally go on forever. The loop would just never terminate, would never know that it was done, or it would know that it was done way too early.”

The response was to shift from reactive step-by-step prompting to upfront scoping. Rather than asking the model what to do next after each action, you ask it at the outset to decompose the overall task into a bounded set of subtasks, and then you execute those subtasks in sequence and stop.

“We pivoted from the early days toward putting more bounds on the problem space. What are the things you think I should do? Break this down into a set of tasks, a set of subtasks, and then I’ll take each of those subtasks in turn, but when I get to that last subtask, I’m done. I’m not going back and asking you for more.”

As models have improved, this has relaxed somewhat. The team now allows a final evaluation pass, a double or triple check at the end, before closing the loop. But the underlying principle has not changed. Get the model to scope the work before execution begins, and treat that scope as a constraint rather than a suggestion.

“You want to get the AI to put bounds and scope around the overall work that you’ll be doing to solve or complete some task, and try to stick to that as a guide so that you don’t run off into the infinite space of querying forever.”

Testing non-deterministic flows

Testing a system where the execution path changes based on the model’s output requires a different approach from conventional integration testing. Nowlan and his team lean toward trace-based evaluation, grounded in a close understanding of realistic user inputs.

“All of our evaluations are, we think, we hope, reasonably close to the reality of those inputs that we’re going to get from our end users.”

The logic is that if you understand the domain well enough, you can anticipate the shape of the tasks users will bring to the system even if you cannot predict the exact inputs. In a web testing context, for example, that means understanding that users are going to log in, fill out forms, scroll through lists, and navigate between pages. Those logical actions become the basis for evaluations, and the evaluations can then check whether the agent followed a reasonable path to completion rather than whether it followed an exact predetermined path.

At the same time, Nowlan acknowledges that no evaluation suite fully covers the space of real-world inputs. The complement to pre-built evaluations is comprehensive logging and tracing of every prompt and response the system exchanges with the model.

“We log and trace all of our inputs and outputs that we exchange to the LLMs, and then we can go back and debug those, and examine those, and we can obviously use AI to probabilistically parse and understand those inputs and outputs.”

This creates a feedback loop that pre-built evaluations cannot replicate. When a user reports a bad experience or churns, the team can go back to the traces for that user, examine the prompt and response sequences, and use a model to evaluate whether the quality of the AI outputs degraded at any point. The evaluation happens after the fact, using the actual production inputs rather than synthetic ones.

Nowlan’s broader point is that evaluation in non-deterministic systems is not a gate you run before deployment. It is an ongoing process that runs in parallel with production, using real data to surface quality issues that no pre-deployment test suite would have caught.

🛠️ Tool of the Week

smolagents — Hugging Face’s open-source library for building agents that think in code

Highlights:

Code-first agents: Generates and executes Python code instead of structured tool calls, which can reduce back-and-forth with the model in some workflows.
Model-agnostic: Works with local Transformers models, Hugging Face Inference API, and providers via LiteLLM.
Supports sandboxing: Can run code in environments like Docker, E2B, or Pyodide, but requires proper setup for safety.
Hub integration: Tools and agents can be shared and reused via the Hugging Face Hub.

Learn more about smolagents

📎 Tech Briefs

Anthropic moves Claude Mythos into controlled early access under Project Glasswing - Limited rollout to partners including Amazon, Apple, and Microsoft for defensive security work, with early reports pointing to strong capability in identifying software vulnerabilities
Microsoft Copilot Studio ships multi-agent orchestration to general availability - Now supports A2A protocol for agent-to-agent delegation and cross-app agent reuse via the Microsoft 365 Agents SDK.
Google releases Gemma 4 under Apache 2.0 - Ranges from lightweight edge models to a 31B-parameter variant with strong performance on reasoning, agentic workflows, and multilingual tasks.
LangGraph ships async subagents in latest Deep Agents update - Adds non-blocking background subagents alongside type-safe streaming and Pydantic coercion in the new v2 API.
Microsoft releases open-source Agent Governance Toolkit - A seven-package system that intercepts every agent action before execution at sub-millisecond latency, with native integrations for LangChain, CrewAI, Google ADK, and Microsoft Agent Framework.

That’s all for today. Thank you for reading this issue of Deep Engineering.

We’ll be back next week with more expert-led content.

Stay awesome,

Saqib Jan

Editor-in-Chief, Deep Engineering

If your company is interested in reaching an audience of senior developers, software engineers, and technical decision-makers, you may want to advertise with us.

Thanks for reading Packt Deep Engineering! Subscribe for free to receive new posts and support my work.

Discussion about this post

Ready for more?