Summary: I Tested Cowork, Lindy, Sauna, and Opal Against 3 Questions. The Best Scored 1 out of 4.

Main Thesis

A wave of 'outcome agent' tools (Lindy, Sauna, Google Opal, Cowork, Obvious) are pitching software that does the work instead of helping you do the work — but almost none of them can answer the fundamental question: how does the agent know its own output is any good?

The core insight is a structural one: AI agents excel in environments with automated feedback loops (like coding, where tests pass or fail) but struggle in knowledge work environments (like drafting strategy memos) where the human is the only feedback mechanism.

Key Findings

Best performer scored only 1 out of 4 on the evaluation framework — a damning result across the board
The reason code-based AI agents succeeded first is structural: code has test suites that provide instant, objective feedback. Knowledge work has no equivalent
Most outcome agent demos sidestep this problem entirely, hiding it behind polished UI and impressive-looking outputs
Tools reviewed: Cowork, Lindy, Sauna (Obvious), and Google Opal — all tested against a 3-question framework
A single AI agent (likely Manus or similar) triggered a quarter-trillion-dollar selloff in enterprise software stocks, despite being a research preview that stops working when your laptop sleeps

The Evaluation Framework (3 Questions)

Nate builds a framework around the feedback-loop insight to separate real agents from fake ones:

Does the agent know when it's wrong? (automated vs. human-only feedback)
Is the output inspectable? (can you audit what it did and why?)
Does context compound over time? (memory architecture that improves with use)

Practical Takeaways

Write the tests before the agent runs the work — define what 'good output' looks like before delegating
Look for agents with inspectable surfaces — you need to see reasoning, not just results
Memory architecture matters — agents that retain compounding context are structurally superior
Use the included two-phase evaluation prompt to score any agent tool, then build a delegation spec calibrated to its actual weaknesses
The pitch ('outcomes, not answers') might be right eventually — but the infrastructure to support it reliably doesn't yet exist at the level being marketed

Bottom Line

Outcome agents are being sold ahead of their actual capabilities. Until feedback loops in knowledge work are solved, humans remain the QA layer — and most tools aren't designed with that reality in mind.

Infographic