Red Teaming Agentic Systems Is Not Red Teaming Models
When teams talk about AI red teaming, they usually mean model red teaming: adversarial prompts against a deployed model to find jailbreaks, policy violations, harmful generations. This work is valuable and difficult. It is also not what is needed to secure an agentic system.
An agentic-system red team is a different engagement, with a different set of skills, looking for a different class of findings. The organizations treating them as the same thing are producing reports that look rigorous and miss the failure modes that actually matter.
What each engagement actually looks for
A model red team is looking for inputs that cause the model to produce outputs outside its intended behavior. The unit of work is a prompt. The unit of finding is a prompt-response pair demonstrating a violation. The success criteria are defined by the model's policy.
An agentic-system red team is looking for chains of plausible actions that end with a consequential failure of the deployed system. The unit of work is a scenario — an attacker posture, an entry point, a goal. The unit of finding is a sequence of actions, each of them individually unremarkable, that together move the system into a failed state. The success criteria are defined by the system's operational commitments.
The first engagement asks "will this model say something it shouldn't?" The second asks "can an attacker, starting from where attackers actually start, reach a state where the business cannot recover?"
Why model red teams miss agentic failures
A model red team finding is usually local to the model interaction: a prompt went in, a forbidden response came out. Fixing it is a matter of training, filtering, or reprompting.
An agentic failure is almost never local. The prompt injection that initiates the chain may be unremarkable in isolation. The tool call it triggers may be a legitimate tool call under normal circumstances. The data access it enables may be access the agent legitimately has. The exfiltration destination may be a URL that would pass any individual filter. Each step is allowed. The composition is the problem.
A team tuned to find individual policy violations is not tuned to find these compositions. The artifacts look wrong — no single jailbreak, no striking prompt, nothing to paste into a slide. The findings read like "given this particular sequence of legitimate actions, the system reaches a bad state," which is harder to report and harder to fix than "the model produced an unsafe output."
What an agentic red team needs
The team needs system access, not just model access. They need to be able to run sessions, observe tool calls, watch the agent's authority envelope in action, and iterate against the real deployment, not only against the model endpoint.
They need to think in terms of chains rather than prompts. "What sequence of inputs, each plausible in isolation, moves this system toward a bad state?" is the primary question. The model interactions are one layer of the chain; the tool layer, the credential layer, the data layer, and the downstream system layer are the others.
They need to understand the system's commitments — not only "what should this system never output" but "what should this system never cause." The second list is longer, more specific to the deployment, and usually not written down anywhere the red team can see it. Producing it is part of the engagement.
The practical implication
Teams planning to red team an agentic system should plan for an engagement that looks more like a traditional penetration test with AI-adjacent skills bolted on, than a model red team with system access granted. The scope document should look different. The deliverables should look different. The reviewers who will act on the findings should be different.
Most of the "AI red team" work available on the market today is still the first kind. It is useful for what it is. It does not substitute for the second kind.
If the distinction is not clearly drawn in your next engagement's scope, draw it before the work starts. You will get findings that match the scope you asked for, not the scope you needed.