Skip to main content

Command Palette

Search for a command to run...

Top AI Papers of the Week

Updated
4 min read
Top AI Papers of the Week

Read the original article

Top AI Papers of the Week (February 9–15, 2026)

From Elvis Saravia's AI Newsletter


Main Thesis

This week's top AI papers collectively push the frontier on autonomous agents, memory design, reasoning efficiency, multi-agent systems, and domain-specific foundation models — with a strong theme of automating what was previously hand-engineered.


Paper Summaries

1. ALMA — Automated Meta-Learning of Memory Designs

Jeff Clune's group introduces a Meta Agent that automatically discovers memory architectures for agentic systems via open-ended code-space search — no hand-engineering required.

  • Discovers domain-specific structures: affordance graphs, strategy libraries, risk-interaction schemas
  • Outperforms all human-designed baselines: 53.9% vs 48.6% with GPT-5-mini
  • Scales better with experience and transfers across foundation models

Paper


2. LLaDA 2.1 — Discrete Diffusion LLMs with Token Editing

Ant Group's major upgrade to diffusion language models breaks the speed-quality trade-off with Token-to-Token (T2T) editing and the first large-scale RL framework for diffusion LLMs.

  • Speedy Mode (throughput) vs Quality Mode (accuracy) — configurable without swapping models
  • LLaDA 2.1-Flash (100B): 892 tokens/sec on HumanEval+; Mini (16B): 1,587 TPS
  • EBPO (Evidence-Based Policy Optimization) enables stable RL training across 33 benchmarks

Paper


3. SkillRL — Recursive Skill-Augmented Reinforcement Learning

Bridges raw experience and policy improvement through automatic skill discovery, storing reusable behavioral patterns instead of noisy trajectories.

  • Hierarchical SkillBank distills trajectories into high-level skills, reducing token footprint
  • Dual retrieval: general heuristics + task-specific skills
  • Skills and policy co-evolve recursively during training
  • Results: 89.9% on ALFWorld, 72.7% on WebShop, +15.3% over baselines

Paper


4. InftyThink+ — Infinite-Horizon Reasoning via RL

Addresses quadratic cost, context limits, and lost-in-the-middle degradation in long chain-of-thought with learned iterative reasoning boundaries.

  • Model autonomously decides when to summarize and what to preserve
  • Two-stage training: supervised cold-start → trajectory-level GRPO
  • +21% accuracy on AIME24 (29.5% → 50.9%) vs vanilla long-CoT RL (38.8%)
  • 50% token reduction with efficiency reward, minimal accuracy trade-off

Paper


5. Agyn — Multi-Agent Software Engineering as Organizational Process

Models software engineering as a team-based organizational workflow with four specialized agents (manager, researcher, engineer, reviewer) — not a monolithic pipeline.

  • Role-specific model routing: large models for reasoning, smaller code-specialized models for implementation
  • Dynamic coordination: manager adapts workflow based on intermediate outcomes
  • 72.2% on SWE-bench 500 without SWE-bench tuning, +7.4% over single-agent baselines

Paper


6. EchoJEPA — Foundation Model for Echocardiography

Latent predictive model trained on 18 million echocardiograms from 300,000 patients, learning to ignore ultrasound noise via JEPA-style objectives.

  • ~20% improvement on LV ejection fraction estimation; ~17% on RV systolic pressure
  • 79% view classification accuracy with 1% labeled data (vs 42% baseline with 100%)
  • Only 2% degradation under acoustic perturbations (vs 17% for competitors)
  • Zero-shot pediatric performance exceeds fully fine-tuned baselines

Paper


7. AdaptEvolve — Confidence-Driven Model Routing for Agentic Loops

Uses intrinsic generation confidence to dynamically route between small and large models during iterative refinement, cutting costs without sacrificing accuracy.

  • ~38% inference cost reduction while retaining ~97.5% of full-model accuracy
  • Model-agnostic, no task-specific tuning required
  • Practical for production agent loops where LLM calls compound rapidly

Paper


8. Gaia2 — Dynamic Agent Benchmark by Meta FAIR

Next-generation benchmark where environments change independently of agent actions, forcing agents to handle temporal pressure, uncertainty, and multi-agent coordination.

  • GPT-5 leads at 42% pass@1; Kimi-K2 leads open-source at 21%
  • Built on open-source Agents Research Environments (ARE) platform
  • Paradigm shift from static to dynamic agentic evaluation

Paper


9. AgentArk — Distilling Multi-Agent Debate into a Single LLM

Transfers reasoning and self-correction abilities of multi-agent debate systems into one model at training time via three hierarchical distillation strategies.

  • Average +4.8% improvement over single-agent baselines on math/reasoning benchmarks
  • Cross-family distillation (e.g., Qwen3-32B → LLaMA-3-8B) yields largest gains
  • Approaches full multi-agent performance at a fraction of inference cost

Paper


10. AgentSkiller — Scaling Tool-Use Agents via Data Quality

Generates 11K high-quality synthetic trajectories across diverse tool-use scenarios, demonstrating data quality beats parameter count.

  • 14B model beats GPT-o3 on tau2-bench: 79.1% vs 68.4%
  • 4B variant outperforms 70B and 235B models
  • Semantic integration across domains is the key differentiator

Paper


Key Takeaways

ThemeInsight
Meta-learningALMA shows memory design itself can be automated, not just learned
Speed-quality trade-offsLLaDA 2.1 and AdaptEvolve both offer configurable knobs between cost and accuracy
Reasoning efficiencyInftyThink+ proves bounded-context iterative reasoning beats long-CoT at lower cost
Organizational AIAgyn demonstrates team structure and role design may matter as much as model size
Data > ParametersAgentSkiller's 4B model beats 235B models — quality data synthesis is a force multiplier
Medical AI at scaleEchoJEPA sets a new bar for label-efficient, robust clinical foundation models

Note: Exact arXiv links were not provided in the source article. Links above are placeholders — check arxiv.org for the specific papers.

Infographic

Infographic wide

More from this blog

A

AI with Alex & Angus

102 posts