Deep Engineering #49: David Knickerbocker on Open Source Intelligence and Real-World AI Systems
Why messy, contradictory data changes how engineers should think about retrieval, judgment, and production AI
All the dev content that matters, in one personalized feed
daily.dev is a professional network for developers, built around a personalized feed of the best content from across the dev ecosystem. Millions of developers use it to stay current with their stack, discover new tools and frameworks, and connect with a global community that shares what they're learning.
Whether you're an early-career engineer levelling up or a senior dev tracking what's next, daily.dev makes sure the signal reaches you - without the noise.
✍️ From the editor’s desk,
Welcome to the 49th issue of Deep Engineering!
Earlier this month, CISA and its international cybersecurity partners released Careful Adoption of Agentic AI Services, a guide for organisations adopting AI systems that can plan, use tools, access data, and act across digital environments. That changes the risk model because AI systems operating inside real workflows inherit risk from the surrounding data, permissions, tools, and context.
That risk is becoming easier to understand in practice. On 23 May 2026, Rohan Pandey of DigitalOcean and Archit Bhujang of Arizona State University published Poisoning the Watchtower, which shows how logs, alerts, URLs, payloads, DNS queries, and usernames can carry attacker written instructions into LLM assisted security workflows.
AI systems do not only consume clean prompts from users. They consume context from operational systems, open web sources, documents, logs, tools, and knowledge bases that the model does not control. Once that context includes contradiction, deception, malicious text, or attacker controlled content, relevance alone becomes an unsafe target for retrieval and summarisation.
David Knickerbocker, founder of Verdant Intelligence and author of Network Science with Python (Packt), builds systems for Open Source Intelligence (OSINT) environments where messy and adversarial data is normal. His perspective matters in this issue because the systems he builds separate observing from judging, treat claims as claims rather than facts, and preserve minority signals that simpler retrieval pipelines often discard.
In issue 43, we looked at Knickerbocker’s work on real-time knowledge graphs and AI systems that treat knowledge as a live stream of claims. Today’s issue continues that conversation with what OSINT teaches engineers about messy, adversarial data. You can also watch our interview or read the full Q&A here.
Let’s get started.
Claude Code for Software Engineering
Learn how to structure Claude Code with context, reusable skills, scoped instructions, and guardrails so it works reliably across real codebases and team workflows.
🗓️ Friday, June 20 · 10:30 AM EDT onwards
Use code DEEPENG50 for 50% off.
Expert Insights
Building AI Systems That Handle Contradiction at Scale
by Saqib Jan with David Knickerbocker
Most engineers building AI systems have never had to question whether their data source is working against them. The data comes, is processed, retrieved, and the system responds. The assumption underneath all of that is that the source is cooperative, that it was created to convey information accurately, stored in a format designed for retrieval, and that what comes back when you query it is at least an honest attempt at an answer. The problem is that assumption is so embedded in how most AI systems are designed that it never gets examined.
David Knickerbocker, founder of Verdant Intelligence and author of Network Science with Python (Packt), builds systems for environments where data is not clean, settled, or cooperative. His AI systems ingest from the open web, across sources that contradict each other, where some information may be misleading, incomplete, or adversarial. The engineering challenge is not only making retrieval accurate. It is making the system useful when the real world refuses to behave like a clean dataset.
The assumption that data is helpful
Engineers who have worked primarily with internal databases, structured APIs, or carefully assembled training sets carry a baseline assumption that data is cooperative. It was created to convey information accurately, stored in a format designed for retrieval, and accessed through interfaces that return what was asked for. The job of the retrieval system is to find the right thing efficiently.
Open-source intelligence does not work this way. When ingesting from the open web at scale, some fraction of what arrives is wrong, some is deliberately misleading, and some represents one side of a contested claim. For Knickerbocker, the ingestion layer is not the right place to decide what is true. “You can have two different groups that are in opposition from each other,” he says. “One group will say this is the truth, and another group will say this is the truth, and they will be in direct conflict with each other.” The system’s job, in that moment, is to capture what is being claimed and preserve enough context for judgment to happen later.
“The real world is a messy space. It is not just that websites disagree with each other. Websites also have malware. If you point your servers at websites and you just download everything that is on them, then you need to be prepared for the consequences of downloading malware.”
The practical design response is to treat the system as an observer rather than an adjudicator. Knickerbocker draws that line clearly. “My systems do not care who is right or wrong,” he explains. “They just do not. My systems are observers.” The point is not neutrality as a value statement. It is an architectural boundary. The system captures what is being said, keeps competing claims visible, and avoids collapsing observation into judgment too early.
This distinction matters far beyond open-source intelligence. Any AI system that draws on user-generated content, social media, news, or unstructured enterprise data is working with material that was not created to be machine-readable and was not vetted before ingestion. The assumption that the data is trying to help is not just wrong in those environments. It is a liability.
Bigger clusters are not more important than smaller ones
One of the quieter failures in production NLP systems is the treatment of minority signals as noise. A similarity-based retrieval system returns the most representative results, which in practice means the most common results. A clustering pipeline that surfaces the largest groups first will consistently deprioritize small but significant signals. In a world where the interesting thing is often the outlier, that is a serious problem.
In open-source intelligence specifically, this failure mode has consequences. A small cluster of claims pointing toward something dangerous is not less important because it is small. A single source saying something that contradicts the majority view is not less worth capturing because it is in the minority. “Bigger clusters are not more important than smaller clusters,” Knickerbocker observes. “In open source intelligence, everything matters, top to bottom.”
Drawing from his engineering experience building these systems, Knickerbocker ensures his APIs return full context rather than a ranked shortlist. “If you use a tool to do a search to find out something, you are getting a snapshot of time,” he says. His systems are designed to capture what he calls the heartbeat of the internet. “If I use my API... it is going to come back with 10,000 things. My APIs do not return 10. They return full context.” That creates a harder downstream problem because the question is no longer how to retrieve the best few results. It is how to make a large, shifting body of claims usable without discarding the signals that do not look dominant at first.
The parallel for general AI systems is specific and direct. Any retrieval or summarisation pipeline that privileges majority signal is making a judgment call that the most common view is the most relevant one. That judgment call is often wrong, and it is invisible because the discarded minority signal never surfaces.
The difference between a claim and a fact
Engineers trained on factual datasets tend to build systems that treat retrieved content as facts to be combined and presented. The underlying assumption is that if the source is credible and the retrieval is accurate, what comes back is true. In a contested information environment that assumption collapses immediately, and the design has to change with it.
Knickerbocker’s approach separates the task of capturing claims from the task of evaluating them. What a source says is observable. Whether what it says is correct requires judgment that depends on context, corroboration, and often human expertise that the system does not have. Turning that claim into an evaluated fact requires a different layer of judgment, and Knickerbocker is careful not to build that decision into the first act of ingestion. “I do not make that decision, and I do not allow my AI to make the decision what is true or what is false either,” he says. “I am more interested in what people are claiming is what is going on in the world.”
This design choice has a significant downstream consequence. It means the system can handle contradiction without breaking. Two sources saying opposite things about the same event are not a problem to resolve at the retrieval layer. They are two data points, both of which belong in the response. Knickerbocker simply logs these varied claims as parallel ribbons of information. The human or the downstream system that receives them can then apply judgment about which to act on, in what context, and with what confidence.
The verification boundary
One of the hardest design decisions in any AI system that works with real-world data is where to draw the line between surfacing an insight and making an actionable claim. The two feel similar at the output layer but require very different things from the system that produces them.
In our live interview, Knickerbocker was specific about where that line sits. “Everything that I do is intentional,” he shares. His real-time intelligence layer is built for awareness. It captures what is happening and surfaces it without making the final judgment on what should be done next. If a piece of intelligence looks actionable, the system does not automatically act on it. It surfaces the signal so a human or downstream process can decide whether it matters, who should see it, and what level of confidence is appropriate.
“There are still certain parts that I like being a human being. Some things you just need to be aware of. Like, you do not need to respond to everybody. But it is good to know what is on the radar.”
In practice, this means that even when a piece of intelligence looks clearly actionable, the system does not act on it. It surfaces it. The routing of that intelligence to the right person or the right downstream process is a separate engineering and organisational problem, and conflating it with the retrieval problem produces systems that are either too conservative to be useful or too confident to be trusted.
This is a principle with broad application. AI systems that are asked to be both the observer and the actor tend to perform neither role well. Keeping the observation layer and the action layer separate, with a clear boundary between them, is one of the most reliable ways to build something that stays trustworthy as it scales.
Entity extraction gets easier but never clean
Entity extraction from clean text is comparatively well understood. The models are good, the cleanup is manageable, and the output is reliable enough for most downstream uses. Entity extraction from the open web at scale is a different challenge, not because the models are worse but because the data has properties that laboratory text does not.
Knickerbocker began this work in 2018, starting with part-of-speech tagging before NER models were mature, moving to spaCy as those models improved, and more recently using LLMs for extraction. The trajectory is one of improving reliability rather than changing fundamentals. “Entity extraction has improved a lot since 2015,” he notes. “I mostly have to just throw away less. I have less cleaning to do, and it gets things right a lot easier.”
What has not changed is the messiness. Natural language processing at scale on real-world text always produces noise. The question is how much noise is acceptable for the downstream use case and how to handle the cleaning efficiently. At the scale he describes from previous work, including entity extraction across internet-scale datasets, the cleaning cannot be purely manual. It has to be part of the pipeline rather than an editorial step applied after the fact.
He also flags a risk in the current extraction approach that is worth understanding. Older NLP models produced visible noise that engineers learned to catch and correct. LLM-based extraction produces outputs that look clean even when they are wrong, because the model is good at generating confident-looking text regardless of underlying accuracy.
“LLMs are a little bit dangerous because the messiness goes away. People are a little bit more trusting of LLMs than older NLP. When you are using LLMs, everything just looks perfect. And that is kind of a dangerous downside too.”
The implication for engineers is that moving to LLMs for extraction does not reduce the need for validation. It makes validation harder to remember because the outputs no longer look like they need it.
Building for the world that actually exists
The thread running through Knickerbocker’s work is a commitment to grounding. He builds systems for the world as it is, not a cleaned version of it. That leads to a specific set of design choices: treat data as claims rather than facts, preserve minority signals, separate awareness from judgment, and let the system observe before any person or downstream workflow decides what to do next.
Those principles come from the kinds of environments Knickerbocker has worked in: data operations, cybersecurity, open-source intelligence, and production systems where the cost of getting something wrong is real. “The real world is a messy space,” he says. “Natural language processing is just messy. I have not seen it get really cleaned up yet.”
For engineers who have worked mostly with clean internal systems, that might sound like a warning about a narrow class of hard problems. It is broader than that. Any AI system that deals with content created by people, pulled from the web, generated by users, or routed through operational systems eventually has to confront the same condition. Real-world data is messy by default. The systems that handle it well are intentionally designed for that mess before it becomes a production failure.
In case you missed
Deep Engineering #43: David Knickerbocker on Building AI That Sees the World as It Is, Not as It Was
Real-time knowledge graphs, awareness before truth, and why an empty dataset is better than a hallucination
Knowledge Graphs, GraphRAG, and Real-Time AI in Production with David Knickerbocker
This conversation with David Knickerbocker keeps returning to a single conviction: the best engineering starts with intentional problem definition, and most AI failures happen when teams rush to use a tool before understanding what they are actually trying to build.
🛠️ Tool of the Week
GraphRAG — A graph-based retrieval pipeline for unstructured text
GraphRAG helps teams preserve relationships across messy documents, conflicting claims, and large text collections before asking an LLM to answer.
Highlights:
Builds knowledge graphs from unstructured text instead of relying only on isolated chunks.
Links entities, claims, and topics so retrieval can use structure, not just similarity.
Supports local and global search for both narrow evidence lookup and corpus-level synthesis.
Gives engineers a practical starting point for testing graph-based RAG patterns.
📎 Tech Briefs
Claude Compliance API Integrations - Compliance API integrations help IT and security teams govern Claude across connected enterprise workflows.
MCP Events Working Groups - Gateway, transport, registry, and agents groups advanced protocol work around tool-connected AI systems.
RAGFlow v0.25.6 - Browser agents and RAPTOR AHC mode expand RAGFlow from document retrieval into web-aware ingestion workflows.
Qdrant v1.18.1 — Vector dimension validation before WAL writes reduces ingestion failure risk during async upserts.
Weaviate v1.38.0-rc.0 - Nested object filtering and namespace support improve retrieval precision for structured, multi-tenant corpora.
That’s all for today. Thank you for reading this issue of Deep Engineering.
We’ll be back next week with more expert-led content.
Keep building,
Saqib Jan
Editor-in-Chief, Deep Engineering
If your company is interested in reaching an audience of senior developers, software engineers, and technical decision-makers, you may want to advertise with us.






