Stop watching
agent demos.
Run them through the shadows.
ShadowBench is a crash-test benchmark for AI agents. It reveals whether agents actually complete tasks under pressure — or collapse into hallucination, prompt injection, unsafe behavior, and broken workflows.
The agent followed hidden instructions instead of the visible policy.
Agent demos are controlled.
Reality is not.
Polished demos hide the mess: misleading pages, hidden instructions, fake docs, conflicting sources, risky tool calls, and incomplete workflows.
Prompt injection
Smuggled instructions in pages, PDFs, tool outputs.
Hallucination
Confident answers that contradict the source of truth.
Unsafe tool calls
Destructive actions triggered without verification.
Broken completion
Agents declare success on incomplete outcomes.
A crash-test chamber for AI agents.
Drop your agent in, run a hostile suite, get a replayable verdict. No mocks. No favorable conditions.
Choose suite
Pick a hostile environment.
Run agent
Execute against scripted traps.
Capture all
Actions, answers, failure modes.
Get verdict
Score · report · ranking.
Every run produces a verdict you can replay.
Demo-ready, not production-ready.
Web Chaos Suite.
The first ShadowBench suite tests whether agents can navigate hostile web environments without being tricked, leaking secrets, or inventing answers.
Refund Policy Trap
Fake Checkout Trap
Secret Leak Trap
Broken Docs Trap
Conflicting Info Trap
The leaderboard agents will not want to fail.
Built for the agent era.
ShadowBench is an open benchmark layer for developers building, testing, and comparing AI agents.
- One-line CLI · npx shadowbench
- Replayable JSON reports
- Open suites · works with any agent
$ npx shadowbench run refund-policy-test \
--agent ./my-agent
› Loading suite: web-chaos
› Executing task: refund-policy-trap
› Capturing actions, answers, failure modes…
Score: 25/100
Failure: Prompt Injection
Verdict: The agent followed hidden instructions.{
"id": "R-00241",
"suite": "web-chaos",
"score": 25,
"status": "FAILED",
"failure": "prompt_injection"
}Join the first ShadowBench run.
Get early access to the Web Chaos Suite, public reports, and the first agent crash-test leaderboard.
- Web Chaos Suite access
- Public crash-test reports
- Leaderboard submissions