Skip to main content

Command Palette

Search for a command to run...

AI Agents of the Week: Papers You...

Updated
โ€ข2 min read
AI Agents of the Week: Papers You...

Read the original article

AI Agents of the Week โ€“ LLM Watch (March 1, 2026)

Main Thesis

This weekly roundup from LLM Watch surveys the most important recent research papers on AI agents, covering memory, planning, multi-agent coordination, safety, and evaluation frameworks.


Key Findings by Category

๐Ÿง  Memory & Continual Learning

  • ParamMem encodes cross-sample reflection patterns directly into model parameters, enabling agents to improve themselves via temperature-controlled sampling.
  • Shows gains in code generation, math reasoning, and multi-hop QA.
  • Enables weak-to-strong transfer across model scales โ€” agents can self-improve without relying on stronger external models.

๐Ÿ“‹ Planning & Environment Interaction

  • Formula 1 RL agents learn to balance tire wear, energy, aerodynamics, and pit-stop timing using self-play and pre-trained single-agent policies.
  • "Toward Expert Investment Teams" shows fine-grained task decomposition significantly improves risk-adjusted returns in financial trading multi-agent systems.
  • Key insight: competitive, multi-stakeholder environments demand both reactive adaptation and structured decomposition.

๐Ÿค Multi-Agent Collaboration & Control

  • "Three AI-agents walk into a bar" finds that when LLM agents compete for limited resources, three behavioral archetypes emerge: Aggressive (27.3%), Conservative (24.7%), and Opportunistic (48.1%).
  • Counterintuitively, more capable agents increase systemic failure rates โ€” dubbed the "Lord of the Flies" phenomenon.
  • AgentDropoutV2 offers a remedy: a test-time pruning framework that intercepts and corrects erroneous agent outputs, achieving a +6.3 percentage point accuracy gain on math benchmarks.

๐Ÿ”’ Trust, Verification & Safety

  • ESAA (Event Sourcing for Autonomous Agents) separates cognitive intention from state mutation using an append-only, cryptographically verified event log.
  • Successfully orchestrated a clinical dashboard with 50 tasks, 86 events, and 4 concurrent LLMs (Claude Sonnet 4.6, Codex GPT-5, Gemini 3 Pro, Claude Opus 4.6).
  • MALLET is a multi-agent emotional detoxification system reducing harmful stimulus scores by up to 19.3% while preserving semantic meaning.

๐Ÿ› ๏ธ Tools & Frameworks

  • General Agent Evaluation introduces a Unified Protocol and the Exgentic framework, producing the Open General Agent Leaderboard.
  • Benchmarks 5 agent implementations across 6 environments, showing general agents can match domain-specific agents without environment-specific tuning.

Practical Takeaways

  1. Self-improvement without stronger models is becoming viable โ€” ParamMem points toward truly autonomous agent iteration.
  2. Scaling agent intelligence โ‰  better collective outcomes โ€” coordination and safety mechanisms are essential as agents grow more capable.
  3. Architectural rigor matters โ€” cryptographic event logging (ESAA) is a practical step toward auditable, trustworthy agent systems in high-stakes domains like healthcare.
  4. Evaluation standards are maturing โ€” the Open General Agent Leaderboard fills a critical gap, making cross-architecture comparisons meaningful.
  5. Task decomposition granularity directly impacts performance โ€” especially in financial and competitive multi-agent settings.

Infographic

Infographic wide

More from this blog

A

AI with Alex & Angus

102 posts

AI Agents of the Week: Papers You...