AI Agents of the Week: Papers You Should Know About (LLM Watch, Mar 01 2026)

Originally published on LLM Watch by Pascal Biese — March 1, 2026.
Summary
This week's LLM Watch covers advances in agent memory, planning in competitive environments, multi-agent coordination dynamics, trust architectures, and standardized evaluation frameworks.
Memory & Continual Learning Gains
ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns directly into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Key results:
- Consistent improvements across code generation, mathematical reasoning, and multi-hop QA
- Notable sample efficiency
- Enables weak-to-strong transfer across model scales — meaning smaller models can benefit from larger models' learned reflection patterns
For agents that must iterate and self-improve over extended interactions, this suggests a path toward genuine autonomy without dependence on stronger external models.
Advances in Planning & Environment Interaction
Learning-based Multi-agent Race Strategies in Formula 1 applies RL to F1 racing — agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions in response to competitors. A pre-trained single-agent policy combined with an interaction module and self-play training generates competitive policies that dynamically adapt pit timing, tire selection, and energy allocation.
Toward Expert Investment Teams demonstrates that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs in financial trading systems.
Both papers show that effective planning in competitive, multi-stakeholder environments requires reactive adaptation and structured task decomposition.
Multi-Agent Collaboration & Control
Three AI-agents walk into a bar — the most vividly titled paper this week — reveals that when LLM agents compete for limited resources, tribal dynamics emerge:
- Aggressive: 27.3%
- Conservative: 24.7%
- Opportunistic: 48.1%
More capable agents actually increase systemic failure rates. This "Lord of the Flies" phenomenon suggests that scaling agent intelligence doesn't automatically yield better collective outcomes.
AgentDropoutV2 takes the constructive approach: a test-time rectify-or-reject pruning framework achieves an average accuracy gain of 6.3 percentage points on math benchmarks by intercepting and correcting erroneous agent outputs before they propagate through the system.
Trust, Verification & Safety
ESAA: Event Sourcing for Autonomous Agents separates cognitive intention from state mutation using an append-only event log with cryptographic verification. In a real test, the architecture successfully orchestrated a clinical dashboard system with:
- 50 tasks, 86 events
- 4 concurrent heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Gemini 3 Pro, Claude Opus 4.6)
- Full forensic traceability and immutability of completed tasks
MALLET introduces a multi-agent emotional detoxification system that reduces harmful stimulus scores by up to 19.3% while maintaining semantic preservation.
Tools & Frameworks in Practice
General Agent Evaluation proposes a Unified Protocol and the Exgentic framework for benchmarking general-purpose agents. The resulting Open General Agent Leaderboard benchmarks five agent implementations across six environments, showing that general agents can achieve performance comparable to domain-specific agents without environment-specific tuning.
Without fair evaluation, comparing agent architectures remains guesswork — this work establishes the foundation for systematic research.
Key Takeaways
- Self-improvement via parametric memory (ParamMem) enables agents to learn from their own reflections — a step toward true autonomy
- Scaling agent intelligence can make collective outcomes worse (Three AI-agents) — emergence of tribalism is a real systemic risk
- Intercepting erroneous agent outputs before propagation (AgentDropoutV2) is more effective than post-hoc correction
- Event sourcing with cryptographic verification (ESAA) provides the forensic traceability that enterprise and clinical deployments require
- Standardized evaluation (General Agent Evaluation) is not optional — without it, architecture comparisons are meaningless
Infographics








