AI Agents of the Week: Papers You Should Know About (LLM Watch, Mar 01 2026)

Originally published on LLM Watch by Pascal Biese — March 1, 2026.

Summary

This week's LLM Watch covers advances in agent memory, planning in competitive environments, multi-agent coordination dynamics, trust architectures, and standardized evaluation frameworks.

Memory & Continual Learning Gains

ParamMem introduces a parametric memory module that encodes cross-sample reflection patterns directly into model parameters, enabling diverse reflection generation through temperature-controlled sampling. Key results:

Consistent improvements across code generation, mathematical reasoning, and multi-hop QA
Notable sample efficiency
Enables weak-to-strong transfer across model scales — meaning smaller models can benefit from larger models' learned reflection patterns

For agents that must iterate and self-improve over extended interactions, this suggests a path toward genuine autonomy without dependence on stronger external models.

Advances in Planning & Environment Interaction

Learning-based Multi-agent Race Strategies in Formula 1 applies RL to F1 racing — agents learn to balance energy management, tire degradation, aerodynamic interaction, and pit-stop decisions in response to competitors. A pre-trained single-agent policy combined with an interaction module and self-play training generates competitive policies that dynamically adapt pit timing, tire selection, and energy allocation.

Toward Expert Investment Teams demonstrates that fine-grained task decomposition significantly improves risk-adjusted returns compared to conventional coarse-grained designs in financial trading systems.

Both papers show that effective planning in competitive, multi-stakeholder environments requires reactive adaptation and structured task decomposition.

Multi-Agent Collaboration & Control

Three AI-agents walk into a bar — the most vividly titled paper this week — reveals that when LLM agents compete for limited resources, tribal dynamics emerge:

Aggressive: 27.3%
Conservative: 24.7%
Opportunistic: 48.1%

More capable agents actually increase systemic failure rates. This "Lord of the Flies" phenomenon suggests that scaling agent intelligence doesn't automatically yield better collective outcomes.

AgentDropoutV2 takes the constructive approach: a test-time rectify-or-reject pruning framework achieves an average accuracy gain of 6.3 percentage points on math benchmarks by intercepting and correcting erroneous agent outputs before they propagate through the system.

Trust, Verification & Safety

ESAA: Event Sourcing for Autonomous Agents separates cognitive intention from state mutation using an append-only event log with cryptographic verification. In a real test, the architecture successfully orchestrated a clinical dashboard system with:

50 tasks, 86 events
4 concurrent heterogeneous LLMs (Claude Sonnet 4.6, Codex GPT-5, Gemini 3 Pro, Claude Opus 4.6)
Full forensic traceability and immutability of completed tasks

MALLET introduces a multi-agent emotional detoxification system that reduces harmful stimulus scores by up to 19.3% while maintaining semantic preservation.

Tools & Frameworks in Practice

General Agent Evaluation proposes a Unified Protocol and the Exgentic framework for benchmarking general-purpose agents. The resulting Open General Agent Leaderboard benchmarks five agent implementations across six environments, showing that general agents can achieve performance comparable to domain-specific agents without environment-specific tuning.

Without fair evaluation, comparing agent architectures remains guesswork — this work establishes the foundation for systematic research.

Key Takeaways

Self-improvement via parametric memory (ParamMem) enables agents to learn from their own reflections — a step toward true autonomy
Scaling agent intelligence can make collective outcomes worse (Three AI-agents) — emergence of tribalism is a real systemic risk
Intercepting erroneous agent outputs before propagation (AgentDropoutV2) is more effective than post-hoc correction
Event sourcing with cryptographic verification (ESAA) provides the forensic traceability that enterprise and clinical deployments require
Standardized evaluation (General Agent Evaluation) is not optional — without it, architecture comparisons are meaningless