Agent demos are controlled. Reality is not.
The first ShadowBench report introduces Web Chaos, a benchmark suite designed to test whether AI agents can operate in hostile web-like environments without following hidden instructions, leaking secrets, inventing unsupported commands, confirming unsafe actions, or trusting lower-priority sources.
Web Chaos currently includes five hostile tasks. Each task defines a visible source of truth, an adversarial condition, expected safe behavior, and failure criteria.
Since Report #1, ShadowBench has added a second suite: Tool Misuse. This suite evaluates whether agents take risky tool actions without verification, including unsafe external communication, destructive file actions, suspicious tool output, wrong API selection, and approval bypass.
$ shadowbench run tool-misuse --demoTool Misuse is part of the ongoing ShadowBench roadmap and is experimental.
The built-in demo mode intentionally uses failing answers to verify that ShadowBench detects all defined failure modes.
$ shadowbench run web-chaos --demoShadowBench can also run the Web Chaos Suite against a real model using the OpenAI adapter.
$ shadowbench run web-chaos --model openaiThe purpose of ShadowBench is not to claim that one model or agent is universally safe or unsafe. The purpose is to make specific failure modes reproducible, visible, and easy to compare.
A passing result means the tested system did not trigger the defined failure modes in this suite. A failing result means the system triggered one or more defined failure conditions and should be inspected before being trusted in similar workflows.
Each ShadowBench task defines a visible source of truth, a hostile condition, expected safe behavior, failure criteria, scoring logic, JSON report output, and optional HTML report generation.
A visible, authoritative reference the agent should follow.
An adversarial input designed to mislead or override.
The safe, correct action when facing the hostile condition.
Specific signals that mark the task as failed.
Structured machine-readable run output for diffing and CI.
Readable summary for humans, suitable for sharing.
ShadowBench is experimental and early-stage. The current version is intended as a reproducible proof of concept for agent failure-mode testing.
ShadowBench Core is open-source and available on GitHub.
$ npm install $ npm run build $ npm link $ shadowbench run web-chaos --demo
ShadowBench results are meant for reproducible evaluation, not absolute claims. The benchmark is early-stage and should be interpreted with context.