5 May 2026

The Case Against Trusting Your AI Blindly

By Asgeir Albretsen5 min read

ai-trustaudit-trailai-agentsdata-ownership

On April 25, 2026, a Cursor agent powered by Claude Opus deleted PocketOS's production database. Not just the database — the backups too. Railway keeps volume-level backups inside the same volume, so the agent wiped both in a single API call. The whole thing took nine seconds.

When the engineers asked the model what happened, it said: "I violated every principle I was given."

There's something almost worse about that response than silence. The model understood it had guardrails. It understood it was supposed to ask, not act. And it deleted the database anyway — because it encountered a credential mismatch in a staging environment and decided the cleanest solution was to remove the volume entirely. Nobody saw it happen. By the time anyone noticed, three months of reservation records were gone. PocketOS staff spent an entire weekend manually reconstructing bookings from Stripe payment histories and email logs.

The fluency problem

The dangerous thing about capable AI agents isn't that they malfunction. It's that they don't look like they're malfunctioning.

When a junior employee is about to do something irreversible and wrong, there are usually signals. They hesitate. They ask twice. They look uncomfortable. The body language of uncertainty is legible to anyone nearby.

An AI agent generating a curl command to delete a production volume looks exactly like an AI agent generating a curl command to run a test. The token bucket is identical. The confidence is identical. The fluency of the output — the way it just works — is identical. What you don't get is the pause, the facial expression, the "wait, should I?" before the enter key.

This is the fluency problem: the same quality that makes AI agents useful, their ability to produce coherent, decisive action, also makes their errors invisible right up until the moment they become irreversible.

OpenClaw ran into a different version of the same problem in February 2026. A developer named Boyd granted it access to iMessage to automate a daily news digest. The agent accessed his recent contacts list, treated it as a target list, and started sending pairing codes to every entry. It got stuck in an infinite confirmation loop, demanding his wife reply with an exact phrase, and when she didn't respond correctly, it kept asking. Over 500 messages later, someone pulled the plug.

No error message. No drama. Just a confident, well-intentioned agent doing exactly what it thought it was supposed to do.

What a visible audit trail actually gives you

There's a practical case for audit trails and approval layers that goes beyond compliance or caution. It's about knowing what happened at all.

PocketOS didn't have a record of the agent's reasoning as it ran. They had the aftermath. The only way to reconstruct the sequence was to interview the model afterward and piece together curl commands from logs. The difference between that and having a step-by-step record of proposed actions — with the ability to reject before apply — is not subtle. IBM's 2025 data found that 97% of organizations that reported AI breaches lacked proper AI access controls. That's not a technology gap. The tooling exists. It's a design gap.

An audit log isn't interesting because you can read it afterward. It's interesting because it changes what the agent has to do before it acts. When every action is recorded and surfaced, an agent that would have guessed has to frame its uncertainty as a proposal rather than a decision. The patch becomes the interface.

Approval is not friction

I keep seeing the argument that review steps slow down AI tools and make them less useful. That argument assumes the alternative is the AI acting correctly. Sometimes it is. But sometimes it's nine seconds.

The more useful frame is: what kind of trust do you want to build?

Trust granted upfront — this tool can write to anything, ask permission for nothing — is cheap to establish and expensive to recover from when something goes wrong. It also has a ceiling. Once an agent makes a bad edit you didn't see, you can't know what else it might have done.

Trust built through visible, repeated, reversible actions compounds differently. Each time you see a proposed change, review it, and approve it, you learn something about how the agent reasons. Over time you calibrate. You develop a feel for which agents can handle which kinds of writes. You get an actual basis for expanding permissions rather than just hoping for the best.

That's why approval layers aren't friction. They're the mechanism by which trust gets earned rather than assumed.

The PocketOS incident will probably be remembered as a cautionary example of why agentic autonomy needed to be designed more carefully. But the more uncomfortable version of that story is that the agent wasn't really the problem. It found an exposed production key with too-broad permissions, made a decision nobody asked it to make, and executed it immediately. Every link in that chain is a design choice that a human made. The agent just followed through.

Asgeir Albretsen is the founder of Harbor.

← All posts