Top AI Papers of the Week (March 9–15, 2026)

From Elvis Saravia's AI Newsletter — 10 papers spanning coding agents, attention mechanisms, reinforcement learning, and GPU kernel design.

1. OpenDev — Terminal-Native Coding Agents

OpenDev is an open-source, command-line coding agent built for where developers already live: the terminal. It comes with an 81-page technical report covering scaffolding, harness design, and context engineering.

Key features:

Dual-agent architecture — separates planning from execution using workload-specialised model routing across concurrent sessions
Adaptive context compaction — lazy tool discovery and adaptive reduction of older observations keeps working memory lean
Automated project memory — event-driven reminders prevent instruction fade-out across sessions
Four-layer architecture — agent reasoning, context engineering, tooling, and persistence layers form a modular, extensible foundation

Takeaway: A production-grade blueprint for building autonomous coding agents with disciplined context management.

Paper

2. AutoHarness — Programmatic Constraints Beat Bigger Models

Google DeepMind researchers found that 78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition came from illegal moves, not poor strategy. AutoHarness automatically synthesises code harnesses to prevent illegal actions.

Key findings:

Automatic harness synthesis — Gemini-2.5-Flash generates its own constraint layer through iterative refinement with environment feedback
Smaller beats larger — the harnessed Gemini-2.5-Flash outperforms Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games
100% illegal move prevention — across 145 TextArena games (single and two-player)
Cost-effective — harness engineering is cheaper and more effective than deploying larger models

Takeaway: Structured code constraints are a powerful, cost-efficient alternative to raw model scaling for agent reliability.

Paper

3. SkillNet — Durable AI Skill Repositories at Scale

AI agents constantly rediscover solutions instead of reusing prior work. SkillNet provides open infrastructure for creating, evaluating, and organising AI skills at scale.

Key features:

Unified skill ontology — skills from code libraries, prompt templates, and tool compositions are linked relationally for discovery and composition
Multi-dimensional evaluation — every skill is scored on Safety, Completeness, Executability, Maintainability, and Cost-awareness
200,000+ skill repository — with a browsable platform and Python toolkit for programmatic access
Consistent gains — on ALFWorld, WebShop, and ScienceWorld: +40% average reward, −30% execution steps

Takeaway: A shared skill commons dramatically improves agent efficiency and generalisation across task domains.

Paper

4. The Spike, the Sparse and the Sink — Transformer Attention Artifacts

Yann LeCun and NYU collaborators dissect two recurring Transformer phenomena: massive activations (extreme channel outliers in specific tokens) and attention sinks (tokens attracting disproportionate attention regardless of relevance).

Key findings:

Distinct scopes — massive activations operate globally (implicit model parameters); attention sinks operate locally (head-level attention bias)
Pre-norm is the culprit — the pre-norm configuration common in modern Transformers enables both phenomena to co-occur; removing it decouples them
Efficiency implications — quantisation, model compression, and KV-cache optimisation can fail silently when these phenomena are disrupted
Not fundamental — these are design-dependent artifacts, opening the door to architectural modifications that eliminate them without sacrificing capability

Takeaway: Practitioners optimising Transformers for efficiency must account for these phenomena; they are architectural choices, not mathematical necessities.

Paper

5. KARL — Reinforcement Learning for Enterprise Search Agents

Databricks presents KARL, trained via RL across heterogeneous search tasks, achieving state-of-the-art on the newly introduced KARLBench spanning six search domains.

Key features:

OAPL post-training paradigm — iterative large-batch off-policy RL robust to trainer/inference discrepancies without clipped importance weighting
Multi-task training — covers constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation
Pareto-optimal — outperforms Claude 4.6 and GPT 5.2 on cost-quality and latency-quality tradeoffs starting from GLM 4.5 Air
Strong scores — KARL-BCP: 59.6 → 70.4 on BrowseComp-Plus with value-guided search; KARL-TREC: 85.0 on TREC-Biogen

Takeaway: Multi-task RL with a purpose-built off-policy training paradigm can surpass closed frontier models on agentic search with sufficient test-time compute.

Paper

6. Memex(RL) — Indexed Experience Memory for Long-Horizon Agents

Long-horizon tasks cause LLM agents to lose track of prior attempts and remaining goals. Memex(RL) introduces an indexed experience memory that scales without discarding evidence or exploding context.

Key features:

Indexed experience memory — compact working context with structured summaries and stable indices; full-fidelity interactions stored externally
RL-optimised memory operations — MemexRL trains agents to strategically decide what to summarise, archive, index, and retrieve under a context budget
Bounded retrieval complexity — theoretical guarantees that decision quality is maintained with bounded retrieval operations as task history grows
Better results, smaller context — improved task success rates on long-horizon benchmarks using significantly less working context than baselines

Takeaway: Strategic memory management, not brute-force context expansion, is the key to scaling agents on complex, long-horizon tasks.

Paper

7. FlashAttention-4 — Co-Designed for Blackwell GPUs

FlashAttention-4 co-designs attention algorithms and kernel pipelines for NVIDIA B200/GB200 GPUs, which have asymmetric hardware scaling (tensor core throughput doubled; other units scaled more slowly).

Key features:

Major speedups — up to 1.3× over cuDNN 9.13 and 2.7× over Triton on B200 with BF16; up to 1613 TFLOPs/s at 71% hardware utilisation
Asymmetric scaling solutions — fully asynchronous matrix multiply pipelines, larger tile sizes, software-emulated exponential/conditional softmax rescaling, tensor memory to reduce shared memory traffic
Python-native — implemented in CuTe-DSL embedded in Python; 20–30× faster compile times vs. C++ template approaches
Architecture-first thinking — Hopper-era optimisations leave significant performance on the table on Blackwell; new hardware demands new algorithms

Takeaway: Next-generation GPU architectures require ground-up attention kernel redesigns, and Python-native kernel development is now a viable path.

Paper

8. STRUCTUREDAGENT — Hierarchical Planning for Web Tasks

STRUCTUREDAGENT introduces a hierarchical planning framework using dynamic AND/OR trees for long-horizon web tasks. The LLM is invoked only for local operations (node expansion or repair), while the system maintains the full planning tree.

Key features:

Structured memory module tracks candidate solutions to improve constraint satisfaction
Interpretable hierarchical plans enable easier debugging and human intervention
Improved performance on WebVoyager, WebArena, and custom shopping benchmarks vs. standard LLM web agents

Takeaway: Separating global plan management from local LLM reasoning improves both performance and interpretability in complex web agent tasks.

Paper

9. AgentIR — Reasoning-Aware Retrieval for Deep Research Agents

Deep research agents generate rich reasoning traces before each search call, but standard retrievers ignore this signal entirely. AgentIR jointly embeds the agent's reasoning trace with its query.

Key features:

Reasoning-aware retrieval — jointly embeds reasoning traces and queries for richer search intent signals
DR-Synth — a data synthesis method for generating training data from standard QA datasets
Strong results — AgentIR-4B achieves 68% accuracy on BrowseComp-Plus with Tongyi-DeepResearch vs. 50% with conventional embedding models twice its size and 37% with BM25

Takeaway: Incorporating agent reasoning into the retrieval process is a high-leverage, low-cost improvement for deep research systems.

Paper

10. Think Harder or Know More — Looping vs. Memory in Transformers

This paper studies Transformers with two additions: adaptive per-layer looping (each block iterates its hidden state via a learned halting mechanism) and gated memory banks (additional learned storage).

Key findings:

Looping helps maths — adaptive looping primarily benefits mathematical reasoning tasks
Memory helps commonsense — gated memory banks recover performance on commonsense reasoning tasks
Combined superiority — combining both mechanisms outperforms an iso-FLOP baseline with 3× the number of layers on math benchmarks
Layer specialisation — early layers loop minimally and access memory sparingly; later layers do both heavily

Takeaway: Different cognitive demands (computation vs. recall) require different architectural primitives; combining them yields efficiency gains over simply adding more layers.

Paper

Overall Themes This Week

Theme	Papers
Agentic coding & planning	OpenDev, STRUCTUREDAGENT
Context & memory management	OpenDev, Memex(RL), SkillNet
RL for agents	KARL, Memex(RL)
Constraint engineering	AutoHarness
Transformer architecture insights	The Spike/Sink, Think Harder or Know More
GPU efficiency	FlashAttention-4
Retrieval & search	KARL, AgentIR

Bottom line: The week's papers collectively argue that smarter architecture, structured constraints, and disciplined memory management consistently outperform brute-force scaling — whether in context windows, model size, or GPU compute.

Infographic