Top AI Papers of the Week

Top AI Papers of the Week (February 23 – March 1, 2026)

Elvis Saravia's weekly AI newsletter rounds up ten notable research papers spanning reasoning efficiency, agent infrastructure, algorithm discovery, personalization, context compression, and code generation. Here's a detailed breakdown:

1. Deep-Thinking Tokens

Google researchers argue that longer outputs don't equal better reasoning. They introduce deep-thinking tokens — tokens where internal model predictions shift significantly across layers before stabilising — measured via Jensen-Shannon divergence between intermediate and final layer distributions. A token qualifies as "deep-thinking" if its prediction only stabilises in the final 15% of layers.

Raw token count negatively correlates with accuracy (r = -0.59), while the deep-thinking ratio shows a strong positive correlation (r = 0.683)
Think@n: a test-time scaling strategy that prioritises high deep-thinking ratio samples, matching self-consistency performance while cutting inference costs ~50%
Validated on AIME 24/25, HMMT 25, GPQA-diamond with GPT-OSS, DeepSeek-R1, Qwen3
Takeaway: Focus on computational depth, not token volume

Paper

2. Codified Context

Single-file AGENTS.md manifests don't scale to large codebases. This paper presents a three-component infrastructure built during development of a 108,000-line C# distributed system, evaluated across 283 development sessions:

Hot-memory constitution: A living document encoding conventions and orchestration protocols consulted at session start
Domain-expert agents: 19 specialised agents, each owning a bounded codebase domain
Cold-memory knowledge base: 34 on-demand specification documents retrieved only when needed
Result: Prevents agents from forgetting conventions, repeating mistakes, and losing coherence across long-running projects

Paper

3. Discovering Multi-Agent Learning Algorithms with LLMs

Google DeepMind uses AlphaEvolve — an evolutionary coding agent powered by LLMs — to automatically discover new multi-agent learning algorithms for imperfect-information games.

VAD-CFR: A novel iterative regret minimisation variant with volatility-sensitive discounting and consistency-enforced optimism — outperforms Discounted Predictive CFR+
SHOR-PSRO: A population-based training variant blending Optimistic Regret Matching with temperature-controlled strategy distributions
AlphaEvolve generates, evaluates, and iteratively refines algorithm candidates
Takeaway: LLMs can act as algorithmic designers, not just code generators; approach could extend to optimisation, scheduling, and resource allocation

Paper

4. Evaluating AGENTS.md

A direct empirical evaluation of whether AGENTS.md files actually improve AI coding agent performance. Four agents tested: Claude Code (Sonnet-4.5), Codex (GPT-5.2 & GPT-5.1 mini), Qwen Code (Qwen3-30b-coder).

Human-written AGENTS.md: modest +4% improvement in some cases
LLM-generated AGENTS.md: -2% performance drop
Both consistently increase inference cost by 20%+
Context files cause broader exploration but worse outcomes — additional context introduces noise
Takeaway: Keep AGENTS.md minimal and focused on critical constraints only; information density beats comprehensiveness

Paper

5. PAHF (Personalized Agents from Human Feedback)

Meta introduces PAHF, a continual agent personalisation framework coupling explicit per-user memory with proactive and reactive feedback mechanisms.

Three-step loop: (1) pre-action clarification, (2) preference-grounded action, (3) post-action memory update
Enables agents to accumulate and revise user preference profiles without retraining
New benchmarks in embodied manipulation and online shopping measuring preference learning and adaptation to preference shifts
Substantially faster learning and outperforms no-memory and single-channel baselines

Paper

6. Doc-to-LoRA

Sakana AI introduces Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass.

Eliminates repeated expensive attention over long contexts; subsequent queries use only adapter weights
Achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at 4x beyond the target LLM's native context window
Outperforms standard long-context approaches on QA datasets with less memory
Best for: Applications requiring repeated queries over the same document (customer support, legal analysis, codebase understanding)

Paper

7. AgentConductor

AgentConductor is a RL-enhanced multi-agent code generation system that dynamically generates interaction topologies based on task complexity rather than using fixed communication patterns.

LLM-based orchestrator constructs density-aware layered DAG topologies adapted to problem difficulty
Simple problems → sparse topologies; complex problems → denser collaboration
Results: Up to 14.6% improvement in pass@1 accuracy, 13% density reduction, 68% token cost reduction vs. strongest baseline
Execution feedback refines topologies when initial solutions fail

Paper

8. ActionEngine

Georgia Tech and Microsoft Research introduce ActionEngine, a training-free framework transforming GUI agents from step-by-step executors into programmatic planners.

Builds a state-machine memory through offline exploration
Synthesises executable Python programs for task completion
Achieves 95% success on Reddit tasks from WebArena
Average of a single LLM call per task; 11.8x cost reduction, 2x latency reduction vs. vision-only baselines

Paper

9. CoT Faithfulness via REMUL

REMUL is a training approach making chain-of-thought reasoning more faithful and monitorable using a speaker-listener RL framework.

A speaker model generates reasoning traces; multiple listener models attempt to follow and complete them
RL rewards reasoning understandable to other models
Tested on BIG-Bench Extra Hard, MuSR, ZebraLogicBench, FOLIO
Improves three faithfulness metrics while boosting accuracy; produces shorter, more direct reasoning chains

Paper

10. Learning to Rewrite Tool Descriptions

Intuit AI Research introduces Trace-Free+, a curriculum learning framework that optimises tool descriptions for LLM agents (not humans) without relying on execution traces.

Delivers consistent gains on unseen tools
Strong cross-domain generalisation
Robust as candidate tools scale beyond 100
Takeaway: Improving tool interfaces is a practical complement to agent fine-tuning

Paper

Key Themes This Week

Efficiency over verbosity: Deep-thinking tokens and AgentConductor both show that targeted computation beats brute-force scaling
Context is a double-edged sword: AGENTS.md evaluation and Codified Context both highlight that more context isn't always better — structure and density matter
LLMs as meta-designers: AlphaEvolve demonstrates LLMs discovering algorithms that humans hadn't considered
Personalisation at scale: PAHF and Doc-to-LoRA both address how to make AI systems adapt to individual users and documents without prohibitive retraining costs

Infographic