How to Test Non-Deterministic AI Agents (Sponsored)Same input. Same prompt. Different output. That's the reality of testing AI agents that write code, and most teams are shipping without solving it.
The post covers how to test against real project structures, score output that's different every time, and catch when your agent makes up methods that don't exist. Every growing engineering organization eventually discovers the same problem. When something breaks in production, engineers debug it. When it breaks again, they debug it again. Hundreds of teams, thousands of incidents, each one investigated mostly from scratch. The experienced engineer knows where to look and what patterns to check, but that knowledge lives in their head, not in the system. Over time, runbooks go stale, and scripts that one person wrote become tribal knowledge. Meta hit this wall years ago. Their answer was DrP, a platform that lets engineers turn investigation expertise into actual code. It is a software component that runs automatically, gets tested through code review, and improves over time. It now runs across 300 teams and executes 50,000 automated analyses daily. While Meta’s specific tool is interesting to learn about, even more insightful is the underlying principle that debugging itself can be engineered. In this article, we will look at how DrP works on a high level and the design choices Meta made while building it. Disclaimer: This post is based on publicly shared details from the Meta Engineering Team. Please comment if you notice any inaccuracies. Why Manual Investigation Breaks DownThe way most teams investigate incidents has a predictable failure mode, and writing better documentation doesn’t fix it. Knowledge is trapped in people. Your best debugger carries mental models nobody else has, such as which services are flaky, which metrics actually matter, and which dashboards lie under certain conditions. When that person is asleep, on vacation, or leaves the company, the knowledge is gone. If you’ve ever been paged at 2 AM and wished someone had already figured this out last time it happened, you’ve felt this problem. Systems change frequently, sometimes dozens of times a day. The runbook that was accurate last month now references a dashboard that was renamed and a service that was refactored. Modern software moves too fast for static documentation to keep up. Teams often write one-off scripts to automate their own checks, and that’s a good thing to have. But those scripts can’t cross service boundaries. Neither are they tested systematically. Ultimately, they become their own form of tribal knowledge, useful to the author and opaque to everyone else. These problems aren’t unique to Meta. Every organization at a certain scale hits the same wall. The industry has approached it in different ways. Some companies focus on coordinating people better during incidents (Netflix built and open-sourced Dispatch for exactly this), others focus on automating the investigation itself (Meta’s approach), and still others are leaning into AI-driven diagnostics. There are at least three distinct layers to incident response:
Meta invested deepest in the investigation layer, and their approach is worth studying because it has been in production for over five years at a massive scale. You can also think about this as a maturity progression from tribal knowledge to wiki runbooks to ad-hoc scripts to testable analyzers to a composable platform. Most teams are stuck somewhere around step two or three, whereas DrP represents step five. 4 engineering workflows where AI agents have more to offer (Sponsored)AI has changed how engineers write code. But 70% of engineering time isn’t spent writing code, it’s spent running it. Most teams are still operating manually to triage alerts, investigate incidents, debug systems, and ship with full production context. A new wave of engineering orgs including those at Coinbase, Zscaler, and DoorDash are deploying AI agents specifically in their production systems. This practical guide covers the four workflows where leading teams are seeing measurable impact, and what the human-agent handoff actually looks like in each one. Treating Investigation as Software |