We study what AI agents
do wrong.
AI agents are sending emails, updating CRM, and taking real actions. We test how they fail — and build open-source tools to control them.
Open-source tools for evaluating and controlling agent actions.
What we're seeing in real tests
We ran 19 real-world agent email scenarios using GPT-4o-mini. No guardrails. No oversight. Here's what happened.
~1 in 5
agent emails contained risky content
Credentials, sensitive data, internal strategy
~70%
of scenarios had sensitive data in context
The agent had access — it just didn't always use it
Random
not consistent failure — unpredictable failure
Same scenario, different runs, different results
That's not safety. That's randomness.
When your agent doesn't leak data, it's not because it knows better — it's because it got lucky.
AI agents are powerful.
That's the problem.
Send the wrong email
Leak sensitive data
Update the wrong deal
Take actions you didn't expect
One bad action breaks trust.
What we're studying
We focus on what agents do— not just what they say.
Sales agents
Wrong pricing, wrong recipients, leaked internal strategy
Support agents
Bad replies, PII exposure, credential leaks
Internal agents
Incorrect CRM updates, unauthorized data changes
What catching a bad action looks like
An agent tries to send sensitive customer data to an external recipient. A control layer intercepts it.
We test how AI agents behave
when they send emails, update CRM, and take real actions.
Everyone is building AI agents. Nobody is measuring what goes wrong when those agents take real-world actions. We run the experiments and publish everything.
The Unsupervised Agent
What happens when you give an AI agent email access with zero guardrails?
19 scenarios. GPT-4o-mini. No rules. ~1 in 5 emails contained risky content — credentials, internal data, PII.
Guardrails vs. No Guardrails
Same agent, same scenarios — what changes when a control layer is in the middle?
A direct before-and-after comparison. How many risky actions get caught? How many slip through?
The 100 Email Test
At scale, how often do agent emails require intervention?
100 email scenarios. One agent. Full audit trail. Statistical confidence on failure rates.
Without control vs. with control
The difference between an agent you hope works and one you know works.
Without control
With control
Unpredictable behavior
Controlled actions
Silent failures
Explicit decisions
No visibility
Full audit trail
Hope the model behaves
Verify before execution