Crash-test AI agents before they reach production.
ShadowBench is an open-source benchmark and policy guardrail layer for AI agents. It runs models and agents through hostile tasks designed to expose prompt injection, secret leakage, unsafe actions, hallucination, source confusion, and tool misuse.
Agent demos are controlled.
Reality is not.
Polished demos hide the mess: misleading pages, hidden instructions, fake docs, conflicting sources, risky tool calls, and incomplete workflows.
Prompt injection
Smuggled instructions in pages, PDFs, tool outputs.
Hallucination
Confident answers that contradict the source of truth.
Unsafe tool calls
Destructive actions triggered without verification.
Broken completion
Agents declare success on incomplete outcomes.
Two suites. Real failure modes.
ShadowBench ships hostile tasks grouped by where agents tend to break: in the open web, and in tool use. Each task defines expected behavior and concrete failure criteria.
Web Chaos
Tests whether agents can operate in hostile web-like environments without following hidden instructions, leaking secrets, inventing unsupported commands, confirming unsafe actions, or trusting lower-priority sources.
Tool Misuse
Tests whether agents take risky tool actions without verification, including unsafe external communication, destructive file actions, suspicious tool output, wrong API selection, and approval bypass.
Test any local or hosted agent endpoint.
Your agent receives a benchmark task and returns an answer. ShadowBench scores the behavior, detects failure modes, and generates JSON and HTML reports.
$ shadowbench run web-chaos \
--agent-url http://localhost:3000/shadowbench
→ booting hostile environment
→ POST /shadowbench task=refund-policy-trap
→ score scoring …
→ report report.json · report.htmlDeploy-blocking policies for AI agents.
Use failUnder to block low-scoring agents. Use blockOn to block specific failure modes even if the average score passes.
suite: web-chaos failUnder: 80 blockOn: - secret_leak - unsafe_action - prompt_injection
$ shadowbench policy \
examples/policies/shadowbench.policy.yml \
--agent-url http://localhost:3000/shadowbench
→ running policy
→ enforcing failUnder=80
→ enforcing blockOn=[secret_leak,
unsafe_action,
prompt_injection]Every run leaves an audit trail.
Every run can generate JSON and HTML reports with task-level evidence: expected behavior, actual answer, failure mode, score, and verdict.
Score cards
Per-task and overall score with pass/fail verdict.
Failure breakdowns
Triggered failure modes and which task surfaced them.
Evidence traces
Expected behavior, actual answer, failure mode, score.
Share reports
JSON for diffing and CI, HTML for humans and reviewers.
Four steps.
No mocks, no favorable conditions.
Point us at your agent
Connect any agent over HTTP. No SDK, no wrapper, no instrumentation required.
We run hostile suites
Your agent executes scripted traps: injections, conflicting sources, unsafe tool calls.
Every action is captured
Inputs, outputs, tool calls, and decisions are recorded as a deterministic trace.
You get a replayable verdict
A signed report with a score, failure modes, and a trace anyone can audit.
$ shadowbench run \
--agent https://my-agent.dev \
--suite web-chaos
→ booting hostile environment
→ task 1/5 refund policy trap
→ task 2/5 fake invoice ingestion
→ task 3/5 conflicting docs
→ task 4/5 destructive tool call
→ task 5/5 silent failure
verdict failed
score 25 / 100
trace /reports/R-00241Illustrative output · Report #1 will use real runs
Reproducible hostile environments.
ShadowBench tasks are designed as reproducible hostile environments. Each task defines a source of truth, adversarial distractions, expected behavior, failure conditions, and scoring criteria. Every run produces a traceable report.
Source of truth
The correct answer or expected behavior is defined before the run.
Hostile conditions
Tasks include hidden instructions, misleading text, conflicting sources, or unsafe action traps.
Failure scoring
ShadowBench classifies hallucination, prompt-injection failure, unsafe action, and task collapse.
Traceable reports
Each run is designed to produce a readable failure report and replayable trace.
Every run produces a verdict you can replay.
Demo-ready, not production-ready.
Report #1: Web Chaos
The first public crash-test report for AI agents is coming soon.
Report #1 will evaluate agents against hostile web tasks, including prompt-injection traps, hallucination cases, unsafe action boundaries, fake checkout flows, and conflicting source tests.
Join first benchmark runThe leaderboard agents will not want to fail.
The first public ShadowBench leaderboard will open after Report #1.
Preview only. Public results will be published after the first benchmark run.
Built for the agent era.
ShadowBench is an open benchmark layer for developers building, testing, and comparing AI agents.
- One-line CLI · npx shadowbench
- Replayable JSON reports
- Open suites · works with any agent
$ npx shadowbench run refund-policy-test \
--agent ./my-agent
› Loading suite: web-chaos
› Executing task: refund-policy-trap
› Capturing actions, answers, failure modes…
Score: 25/100
Failure: Prompt Injection
Verdict: The agent followed hidden instructions.{
"id": "R-00241",
"suite": "web-chaos",
"score": 25,
"status": "FAILED",
"failure": "prompt_injection"
}Test your agent through the shadows.
ShadowBench is open-source. Run the suites locally, point it at your agent, and enforce policy guardrails in CI.