Skip to main content

Command Palette

Search for a command to run...

Top AI Papers of the Week

Updated
5 min read
Top AI Papers of the Week

Read the original article

Top AI Papers of the Week (February 23 – March 1, 2026)

Elvis Saravia's weekly AI newsletter rounds up ten notable research papers spanning reasoning efficiency, agent infrastructure, algorithm discovery, personalization, context compression, and code generation. Here's a detailed breakdown:


1. Deep-Thinking Tokens

Google researchers argue that longer outputs don't equal better reasoning. They introduce deep-thinking tokens — tokens where internal model predictions shift significantly across layers before stabilising — measured via Jensen-Shannon divergence between intermediate and final layer distributions. A token qualifies as "deep-thinking" if its prediction only stabilises in the final 15% of layers.

  • Raw token count negatively correlates with accuracy (r = -0.59), while the deep-thinking ratio shows a strong positive correlation (r = 0.683)
  • Think@n: a test-time scaling strategy that prioritises high deep-thinking ratio samples, matching self-consistency performance while cutting inference costs ~50%
  • Validated on AIME 24/25, HMMT 25, GPQA-diamond with GPT-OSS, DeepSeek-R1, Qwen3
  • Takeaway: Focus on computational depth, not token volume

Paper


2. Codified Context

Single-file AGENTS.md manifests don't scale to large codebases. This paper presents a three-component infrastructure built during development of a 108,000-line C# distributed system, evaluated across 283 development sessions:

  • Hot-memory constitution: A living document encoding conventions and orchestration protocols consulted at session start
  • Domain-expert agents: 19 specialised agents, each owning a bounded codebase domain
  • Cold-memory knowledge base: 34 on-demand specification documents retrieved only when needed
  • Result: Prevents agents from forgetting conventions, repeating mistakes, and losing coherence across long-running projects

Paper


3. Discovering Multi-Agent Learning Algorithms with LLMs

Google DeepMind uses AlphaEvolve — an evolutionary coding agent powered by LLMs — to automatically discover new multi-agent learning algorithms for imperfect-information games.

  • VAD-CFR: A novel iterative regret minimisation variant with volatility-sensitive discounting and consistency-enforced optimism — outperforms Discounted Predictive CFR+
  • SHOR-PSRO: A population-based training variant blending Optimistic Regret Matching with temperature-controlled strategy distributions
  • AlphaEvolve generates, evaluates, and iteratively refines algorithm candidates
  • Takeaway: LLMs can act as algorithmic designers, not just code generators; approach could extend to optimisation, scheduling, and resource allocation

Paper


4. Evaluating AGENTS.md

A direct empirical evaluation of whether AGENTS.md files actually improve AI coding agent performance. Four agents tested: Claude Code (Sonnet-4.5), Codex (GPT-5.2 & GPT-5.1 mini), Qwen Code (Qwen3-30b-coder).

  • Human-written AGENTS.md: modest +4% improvement in some cases
  • LLM-generated AGENTS.md: -2% performance drop
  • Both consistently increase inference cost by 20%+
  • Context files cause broader exploration but worse outcomes — additional context introduces noise
  • Takeaway: Keep AGENTS.md minimal and focused on critical constraints only; information density beats comprehensiveness

Paper


5. PAHF (Personalized Agents from Human Feedback)

Meta introduces PAHF, a continual agent personalisation framework coupling explicit per-user memory with proactive and reactive feedback mechanisms.

  • Three-step loop: (1) pre-action clarification, (2) preference-grounded action, (3) post-action memory update
  • Enables agents to accumulate and revise user preference profiles without retraining
  • New benchmarks in embodied manipulation and online shopping measuring preference learning and adaptation to preference shifts
  • Substantially faster learning and outperforms no-memory and single-channel baselines

Paper


6. Doc-to-LoRA

Sakana AI introduces Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass.

  • Eliminates repeated expensive attention over long contexts; subsequent queries use only adapter weights
  • Achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at 4x beyond the target LLM's native context window
  • Outperforms standard long-context approaches on QA datasets with less memory
  • Best for: Applications requiring repeated queries over the same document (customer support, legal analysis, codebase understanding)

Paper


7. AgentConductor

AgentConductor is a RL-enhanced multi-agent code generation system that dynamically generates interaction topologies based on task complexity rather than using fixed communication patterns.

  • LLM-based orchestrator constructs density-aware layered DAG topologies adapted to problem difficulty
  • Simple problems → sparse topologies; complex problems → denser collaboration
  • Results: Up to 14.6% improvement in pass@1 accuracy, 13% density reduction, 68% token cost reduction vs. strongest baseline
  • Execution feedback refines topologies when initial solutions fail

Paper


8. ActionEngine

Georgia Tech and Microsoft Research introduce ActionEngine, a training-free framework transforming GUI agents from step-by-step executors into programmatic planners.

  • Builds a state-machine memory through offline exploration
  • Synthesises executable Python programs for task completion
  • Achieves 95% success on Reddit tasks from WebArena
  • Average of a single LLM call per task; 11.8x cost reduction, 2x latency reduction vs. vision-only baselines

Paper


9. CoT Faithfulness via REMUL

REMUL is a training approach making chain-of-thought reasoning more faithful and monitorable using a speaker-listener RL framework.

  • A speaker model generates reasoning traces; multiple listener models attempt to follow and complete them
  • RL rewards reasoning understandable to other models
  • Tested on BIG-Bench Extra Hard, MuSR, ZebraLogicBench, FOLIO
  • Improves three faithfulness metrics while boosting accuracy; produces shorter, more direct reasoning chains

Paper


10. Learning to Rewrite Tool Descriptions

Intuit AI Research introduces Trace-Free+, a curriculum learning framework that optimises tool descriptions for LLM agents (not humans) without relying on execution traces.

  • Delivers consistent gains on unseen tools
  • Strong cross-domain generalisation
  • Robust as candidate tools scale beyond 100
  • Takeaway: Improving tool interfaces is a practical complement to agent fine-tuning

Paper


Key Themes This Week

  • Efficiency over verbosity: Deep-thinking tokens and AgentConductor both show that targeted computation beats brute-force scaling
  • Context is a double-edged sword: AGENTS.md evaluation and Codified Context both highlight that more context isn't always better — structure and density matter
  • LLMs as meta-designers: AlphaEvolve demonstrates LLMs discovering algorithms that humans hadn't considered
  • Personalisation at scale: PAHF and Doc-to-LoRA both address how to make AI systems adapt to individual users and documents without prohibitive retraining costs

Infographic

Infographic wide

More from this blog

A

AI with Alex & Angus

102 posts