Top AI Papers of the Week (March 9–15, 2026)

Elvis Saravia's AI Newsletter rounds up ten significant AI research papers spanning coding agents, LLM efficiency, reinforcement learning, memory systems, and hardware optimisation. Here's a detailed breakdown:

1. OpenDev — Terminal-Native Coding Agents

OpenDev is an open-source, command-line coding agent with an 81-page technical report covering scaffolding, harness design, and context engineering.

Key findings:

Dual-agent architecture separates planning from execution using workload-specialised model routing
Adaptive context compaction uses lazy tool discovery to prevent context bloat and reasoning degradation
Automated project memory retains critical project context across sessions without manual intervention
Four-layer architecture spans reasoning, context engineering, tooling, and persistence layers

Takeaway: Terminal-first AI agents can be made production-ready through modular design and intelligent context management rather than brute-force compute.

Paper

2. AutoHarness — Automatic Code Harness Synthesis for LLM Agents

Google DeepMind researchers tackle a striking problem: in the Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were due to illegal moves, not poor strategy.

Key findings:

AutoHarness uses Gemini-2.5-Flash to automatically synthesise a code harness through iterative refinement with environment feedback
The harness acts as a programmatic constraint layer, completely preventing illegal moves across 145 TextArena games
Smaller models beat larger ones: Gemini-2.5-Flash with the harness outperforms Gemini-2.5-Pro and GPT-5.2-High on 16 single-player games
The approach is more cost-effective than scaling to a larger model

Takeaway: Structured code constraints can substitute for raw model capability, reframing agent improvement as harness engineering rather than model scaling.

Paper

3. SkillNet — Open Infrastructure for Reusable AI Skills

SkillNet addresses the problem of AI agents repeatedly rediscovering solutions instead of systematically reusing learned knowledge.

Key findings:

Unified skill ontology structures skills from heterogeneous sources (code libraries, prompt templates, tool compositions) with rich relational connections
Multi-dimensional evaluation across five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-awareness
200,000+ skill repository with interactive browsing, management platform, and Python toolkit
Experiments on ALFWorld, WebShop, and ScienceWorld show 40% improvement in average rewards and 30% reduction in execution steps

Takeaway: Building a structured, quality-gated skill library at scale dramatically outperforms agents that start from scratch on each task.

Paper

4. The Spike, the Sparse and the Sink — Transformer Phenomena Dissected

Yann LeCun and NYU collaborators analyse two recurring Transformer phenomena: massive activations (extreme outliers in specific channels) and attention sinks (tokens attracting disproportionate attention regardless of semantic relevance).

Key findings:

Massive activations operate globally, inducing near-constant hidden representations acting as implicit model parameters
Attention sinks operate locally, biasing individual attention heads toward short-range dependencies
Pre-norm configuration is identified as the key architectural element enabling their co-occurrence
Removing pre-norm causes the two phenomena to decouple entirely
Their co-occurrence is a design-dependent artifact, not a fundamental requirement for performance

Takeaway: Understanding these phenomena is critical for model compression, quantisation, and KV-cache optimisation — many efficiency techniques fail silently by disrupting them.

Paper

5. KARL — Reinforcement Learning for Enterprise Search Agents

Databricks presents KARL, achieving state-of-the-art performance on hard-to-verify agentic search tasks, alongside the new KARLBench evaluation framework spanning six search domains.

Key findings:

OAPL (Off-policy Agentic Policy Learning): iterative large-batch off-policy RL robust to discrepancies between trainer and inference engine without clipped importance weighting heuristics
Multi-task heterogeneous training across constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation
Pareto-optimal on KARLBench vs. Claude 4.6 and GPT 5.2 across cost-quality and latency-quality tradeoffs
KARL-BCP achieves 59.6 on BrowseComp-Plus, improving to 70.4 with value-guided search; KARL-TREC reaches 85.0 on TREC-Biogen

Takeaway: Multi-task RL training with off-policy robustness produces search agents that outperform closed frontier models given sufficient test-time compute.

Paper

6. Memex(RL) — Indexed Experience Memory for Long-Horizon Tasks

Memex(RL) solves the problem of LLM agents losing track of learned information as tasks grow longer and more complex.

Key findings:

Indexed experience memory: compact working context with structured summaries and stable indices, with full-fidelity interactions stored in an external database
The agent learns what to summarise, what to archive, how to index, and when to retrieve through RL optimisation
RL-optimised memory operations with reward shaping tailored to indexed memory usage under a context budget
Theoretical guarantee: bounded retrieval complexity maintains decision quality as task history grows
Empirically achieves better task success rates with significantly smaller working context than baselines

Takeaway: Intelligent indexed memory management outperforms brute-force context expansion — less context used more strategically wins.

Paper

7. FlashAttention-4 — Co-Designed for Blackwell GPUs

FlashAttention-4 co-designs algorithms and kernel pipelines specifically for NVIDIA B200 and GB200 GPUs, where tensor core throughput doubles while other units scale more slowly.

Key findings:

Achieves up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton on B200 GPUs in BF16, reaching 1613 TFLOPs/s at 71% hardware utilisation
Techniques: fully asynchronous matrix multiply pipelines, larger tile sizes, software-emulated exponential and conditional softmax rescaling, tensor memory to reduce shared memory traffic
Python-native CuTe-DSL implementation achieves 20-30x faster compile times than C++ template approaches
Demonstrates that Blackwell GPUs require fundamentally new kernel designs, not incremental optimisation of Hopper-era techniques

Takeaway: Hardware-algorithm co-design is essential for next-generation GPUs — existing attention kernels leave significant performance on the table.

Paper

8. STRUCTUREDAGENT — Hierarchical Planning for Web Tasks

STRUCTUREDAGENT introduces hierarchical planning for long-horizon web tasks using dynamic AND/OR trees, separating tree construction from LLM invocation (used only for local node expansion or repair).

Key findings:

Dynamic AND/OR tree structure enables interpretable hierarchical plans for easier debugging and human intervention
A structured memory module tracks candidate solutions to improve constraint satisfaction
Shows improved performance over standard LLM-based web agents on WebVoyager, WebArena, and custom shopping benchmarks

Takeaway: Separating planning structure from LLM calls reduces hallucination risk and improves interpretability in complex web navigation tasks.

Paper

9. AgentIR — Reasoning-Aware Retrieval for Deep Research Agents

AgentIR addresses the fact that deep research agents generate rich reasoning traces before search calls, but existing retrievers completely ignore these signals.

Key findings:

Reasoning-aware retrieval jointly embeds the agent's reasoning trace alongside its query
DR-Synth synthesises training data from standard QA datasets
AgentIR-4B achieves 68% accuracy on BrowseComp-Plus with Tongyi-DeepResearch vs. 50% with conventional embedding models twice its size and 37% with BM25

Takeaway: Leveraging reasoning traces as retrieval signals provides dramatic accuracy gains over query-only retrieval, even with much smaller models.

Paper

10. Think Harder or Know More — Adaptive Looping vs. Memory Banks

This paper investigates transformers with both adaptive per-layer looping (learned halting mechanism) and gated memory banks (additional learned storage).

Key findings:

Looping primarily benefits mathematical reasoning; memory banks help recover commonsense task performance
Combining both outperforms an iso-FLOP baseline with 3x the number of layers on math benchmarks
Layer specialisation analysis: early layers loop minimally and access memory sparingly; later layers do both more heavily

Takeaway: Compute reuse (looping) and knowledge storage (memory banks) address complementary cognitive needs — combining them is more efficient than simply adding more layers.

Paper

Overall Themes This Week

Theme	Papers
Agentic coding & task execution	OpenDev, AutoHarness, STRUCTUREDAGENT
Memory & context management	Memex(RL), SkillNet, AgentIR
Training & RL for agents	KARL, Think Harder or Know More
Transformer architecture analysis	The Spike the Sparse and the Sink
Hardware optimisation	FlashAttention-4

The week's dominant narrative: AI agents are maturing beyond raw LLM capability — structured constraints, intelligent memory, skill reuse, and hardware-aware design are emerging as the key levers for production-grade AI systems.

Infographic