Skip to main content

Command Palette

Search for a command to run...

AI Agents of the Week: Papers You Should Know About (LLM Watch, Mar 01 2026)

Updated
3 min read
AI Agents of the Week: Papers You Should Know About (LLM Watch, Mar 01 2026)

Originally published on LLM Watch by Pascal Biese — March 1, 2026.


Summary

This week's LLM Watch covers advances in agent memory, planning in competitive environments, multi-agent coordination dynamics, trust architectures, and standardized evaluation frameworks.


Memory & Continual Learning Gains

ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns directly into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Key results:

  • Consistent improvements across code generation, mathematical reasoning, and multi-hop QA
  • Notable sample efficiency
  • Enables weak-to-strong transfer across model scales — meaning smaller models can benefit from larger models' learned reflection patterns

For agents that must iterate and self-improve over extended interactions, this suggests a path toward genuine autonomy without dependence on stronger external models.


Advances in Planning & Environment Interaction

Learning-based Multi-agent Race Strategies in Formula 1 applies RL to F1 racing — agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions in response to competitors. A pre-trained single-agent policy combined with an interaction module and self-play training generates competitive policies that dynamically adapt pit timing, tire selection, and energy allocation.

Toward Expert Investment Teams demonstrates that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs in financial trading systems.

Both papers show that effective planning in competitive, multi-stakeholder environments requires reactive adaptation and structured task decomposition.


Multi-Agent Collaboration & Control

Three AI-agents walk into a bar — the most vividly titled paper this week — reveals that when LLM agents compete for limited resources, tribal dynamics emerge:

  • Aggressive: 27.3%
  • Conservative: 24.7%
  • Opportunistic: 48.1%

More capable agents actually increase systemic failure rates. This "Lord of the Flies" phenomenon suggests that scaling agent intelligence doesn't automatically yield better collective outcomes.

AgentDropoutV2 takes the constructive approach: a test-time rectify-or-reject pruning framework achieves an average accuracy gain of 6.3 percentage points on math benchmarks by intercepting and correcting erroneous agent outputs before they propagate through the system.


Trust, Verification & Safety

ESAA: Event Sourcing for Autonomous Agents separates cognitive intention from state mutation using an append-only event log with cryptographic verification. In a real test, the architecture successfully orchestrated a clinical dashboard system with:

  • 50 tasks, 86 events
  • 4 concurrent heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Gemini 3 Pro, Claude Opus 4.6)
  • Full forensic traceability and immutability of completed tasks

MALLET introduces a multi-agent emotional detoxification system that reduces harmful stimulus scores by up to 19.3% while maintaining semantic preservation.


Tools & Frameworks in Practice

General Agent Evaluation proposes a Unified Protocol and the Exgentic framework for benchmarking general-purpose agents. The resulting Open General Agent Leaderboard benchmarks five agent implementations across six environments, showing that general agents can achieve performance comparable to domain-specific agents without environment-specific tuning.

Without fair evaluation, comparing agent architectures remains guesswork — this work establishes the foundation for systematic research.


Key Takeaways

  1. Self-improvement via parametric memory (ParamMem) enables agents to learn from their own reflections — a step toward true autonomy
  2. Scaling agent intelligence can make collective outcomes worse (Three AI-agents) — emergence of tribalism is a real systemic risk
  3. Intercepting erroneous agent outputs before propagation (AgentDropoutV2) is more effective than post-hoc correction
  4. Event sourcing with cryptographic verification (ESAA) provides the forensic traceability that enterprise and clinical deployments require
  5. Standardized evaluation (General Agent Evaluation) is not optional — without it, architecture comparisons are meaningless

Infographics

AI Agents Research Weekly - Portrait

AI Agents Research Weekly - Landscape

More from this blog

A

AI with Alex & Angus

102 posts