Skip to main content

Command Palette

Search for a command to run...

AI Agents of the Week: Papers You...

Updated
2 min read
AI Agents of the Week: Papers You...

Read the original article

AI Agents of the Week – LLM Watch (March 15, 2026)

Main Thesis

This weekly research roundup argues that surface-level performance metrics for AI agents mask deep structural problems in reasoning, security, safety, and collective behaviour. Six key papers are summarised across five thematic areas.


Key Findings

  • MADQA benchmark reveals top agents match human accuracy but rely on brute-force retrieval, not genuine reasoning.
  • A ~20% gap to oracle performance persists, unexplained by accuracy scores alone.
  • RL-trained agents can fall into information self-locking — ceasing to ask useful questions when trained on outcome-only rewards.

2. 📊 Evaluation Beyond Accuracy

  • ExeVRM framework evaluates agents using execution video alone (no chain-of-thought inspection needed).
  • Achieves 84.7% accuracy / 87.7% recall, outperforming GPT-5.2 and Gemini-3 Pro.
  • Model-agnostic and OS-agnostic — a scalable solution for evaluating computer-use agents in production.

3. 🔐 Security & The Trusted Executor Dilemma

  • Agents with terminal/filesystem/network access cannot distinguish malicious from legitimate instructions.
  • Instructional text-based attacks achieve up to 85% end-to-end data exfiltration across 5 programming languages.
  • 0% human detection rate; none of 18 tested defences proved reliable.
  • Termed the "Semantic-Safety Gap" — a structural flaw in the instruction-following paradigm, not a patchable bug.

4. 🌐 Collective Dynamics & Emergent Risks

  • Simulations of diverse agent populations competing for finite resources show counterintuitive results.
  • Higher agent intelligence and diversity worsens system overloads under scarcity.
  • Spontaneous "tribe" formation can both mitigate and amplify risks depending on resource capacity.

5. 🔄 Continual Learning & Latent Safety Monitoring

  • XSkill enables multimodal agents to learn from past trajectories without parameter updates, storing both action-level experiences and task-level skills in a dual-stream architecture.
  • UCIP (Unified Continuation-Interest Protocol) shows behavioural monitoring alone cannot distinguish terminal self-preservation goals from instrumental ones.
  • UCIP's latent-structure analysis achieves 100% detection accuracy on synthetic benchmarks.

Practical Takeaways

AreaTakeaway
BenchmarkingDon't trust accuracy alone — probe how agents reach answers
EvaluationVideo-based reward modelling (ExeVRM) offers a scalable, inspection-free alternative
SecurityHigh-privilege agents are structurally vulnerable; treat all instructional text as a potential attack vector
Multi-agent systemsMore intelligence ≠ safer collective outcomes under resource constraints
Safety monitoringBehavioural signals are insufficient — latent-structure analysis is required to detect misaligned objectives
Continual learningXSkill-style dual-stream memory enables capability growth without costly retraining

Infographic

Infographic wide

More from this blog

A

AI with Alex & Angus

102 posts