AI with Alex & Angus

512,000 Lines of Leaked Code Reveal the Lock-In Strategy Coming for Your AI Stack

Alex Rzem — Thu, 09 Apr 2026 12:03:26 GMT

512,000 Lines of Leaked Code Reveal Anthropic's Lock-In Strategy

Main Thesis

Anthropic accidentally published ~500,000 lines of Claude Code source code via a packaging error. Buried within it is evidence of an unannounced always-on agent called Conway — and when combined with Anthropic's recent product moves, it reveals a deliberate platform lock-in strategy comparable to historical tech monopoly plays.

Key Findings

What is Conway?

A standalone agent environment (separate from Claude chat)
Always-on: can be woken by external events
Has browser control and integrations with third-party tools
Supports its own proprietary extension format (.cnw.zip)
Not publicly announced — discovered only through the leak

The Five Strategic Moves

Nate connects Conway to five other Anthropic initiatives as a unified platform play:

Claude Code Channels — deepening developer workflow integration
Cowork — collaborative agent environments
The Marketplace — ecosystem of tools/extensions
The Partner Network — third-party lock-in via certified integrations
The OpenClaw ban — controlling what agents can connect to

The `.cnw.zip` Question

Conway's proprietary extension format sits on top of MCP (Model Context Protocol)
Nate compares this to the Google Play Services playbook: open standard underneath, proprietary layer on top that becomes the real dependency
Tool builders targeting Conway's format become dependent on Anthropic's ecosystem

The Lock-In Nobody's Talking About

An always-on agent that learns your workflows, preferences, and organizational context builds behavioral memory
This creates switching costs deeper than anything Microsoft or Salesforce built — because it's not just data, it's learned context about how your team thinks and operates
Moving away means losing an AI that has internalized your organization

Practical Takeaways

Map your platform dependencies before Conway-style agents become default infrastructure
Negotiate portability clauses in enterprise AI contracts now, before lock-in is established
Choose your agent memory architecture deliberately — don't let vendor defaults make that decision for you
Nate provides three prompts to help teams action each of these steps
The historical parallel: companies that ignored similar platform consolidation moves in prior tech cycles paid dearly — treat this as an early warning signal

Bottom Line

Conway isn't just a product feature — it's Anthropic's bid to become the operating system layer for enterprise AI. The leak revealed the strategy before the announcement. Teams deploying AI at scale should be paying close attention now.

AI Agents Weekly: GPT-5.3 Codex Spark

Alex Rzem — Thu, 09 Apr 2026 06:28:52 GMT

Read the original article

AI Agents Weekly: GPT-5.3-Codex-Spark & More — Summary

From Elvis Saravia's AI Newsletter, February 14, 2026

Main Thesis

This issue covers a packed week in AI agents and frontier models, headlined by OpenAI's new agentic coding model, Zhipu AI's open-source powerhouse, and a wave of breakthroughs across autonomous systems, benchmarks, and developer tooling.

Key Stories (Accessible Content)

🔥 GPT-5.3-Codex-Spark (OpenAI)

OpenAI's most capable agentic coding model to date, running 25% faster than its predecessor.
Self-developing: Early versions of GPT-5.3 were used to debug its own training, manage deployment, and interpret evaluation results — making it the first OpenAI model instrumental in its own creation.
Beyond coding: Handles professional knowledge-work outputs including presentations, spreadsheets, and documentation. Wins or ties 70.9% of evaluations on the GDPval knowledge-work benchmark.
Cybersecurity flag: First OpenAI model to hit "high" cybersecurity capability under their Preparedness Framework — meaning it could meaningfully enable real-world cyber harm if misused. OpenAI responded by announcing a $10M API credits program for cyber defense research.

🧠 GLM-5 (Zhipu AI)

A massive 744B-parameter Mixture-of-Experts (MoE) model with 40B active parameters, built specifically for agentic intelligence and multi-step reasoning.
Hardware independence: Trained entirely on Huawei Ascend chips using the MindSpore framework — no US-manufactured semiconductors involved.
Agent Mode: Native autonomous task decomposition, breaking high-level goals into subtasks with minimal human input. Can convert raw prompts into polished .docx, .pdf, and .xlsx documents.
Training scale: Pre-trained on 28.5 trillion tokens (a 23.9% increase over GLM-4.7). Uses a novel RL technique achieving record-low hallucination rates.
Open & affordable: Released under MIT license with open weights. Available on OpenRouter at ~$0.80/M input tokens and $2.56/M output tokens — roughly 6× cheaper than comparable proprietary models.

Other Headlines (Paywalled — Titles Only)

MiniMax M2.5 — New open-source model drop
Recursive Language Models — Replacing context stuffing
OpenAI ships 1M lines with zero manual code
Agentica pushes ARC-AGI-2 with recursive agents
Chrome WebMCP early preview launched
Anthropic raises $30B at a $380B valuation
Excalidraw launches official MCP server
Hive agent framework evolves at runtime
Waymo begins 6th-gen autonomous operations
Gemini 3 Deep Think solves 18 open mathematical problems

Practical Takeaways

Agentic coding is maturing fast — GPT-5.3-Codex-Spark sets a new bar for autonomous software development, including self-referential model improvement.
Open-source is competitive — GLM-5 challenges proprietary frontier models at a fraction of the cost, with full hardware sovereignty.
Cybersecurity risk is real — As models hit "high" capability thresholds, responsible deployment frameworks and defense investment are becoming non-negotiable.
Agent infrastructure is exploding — MCP servers, agentic frameworks, and recursive agent architectures are rapidly becoming standard developer tooling.
Hardware geopolitics matter — GLM-5's Huawei Ascend training stack signals a maturing alternative AI hardware ecosystem outside US supply chains.

Note: No arXiv papers were linked or cited in the accessible portion of this article.

Top AI Papers of the Week

Alex Rzem — Thu, 09 Apr 2026 06:27:59 GMT

Read the original article

Top AI Papers of the Week (February 9–15, 2026)

From Elvis Saravia's AI Newsletter

This week's roundup covers ten significant AI research papers spanning agentic memory design, diffusion language models, reinforcement learning, medical AI, and multi-agent benchmarking.

1. ALMA — Automated Meta-Learning of Memory Designs for Agentic Systems

Main Thesis: Instead of hand-engineering memory modules for AI agents, ALMA uses a Meta Agent to automatically discover optimal memory architectures through open-ended exploration in code space.

Key Findings:

Searches over database schemas, retrieval mechanisms, and update strategies as executable code
Discovers domain-specific memory structures: affordance graphs (ALFWorld), task signature databases (TextWorld), strategy libraries (Baba Is AI), risk-interaction schemas (MiniHack)
Achieves 12.3% avg success with GPT-5-nano vs 8.6% for best human baseline; 53.9% with GPT-5-mini vs 48.6%
Designs scale better with experience and transfer across foundation models

Practical Takeaway: Memory design for agents can be automated — no more hand-crafted modules needed. Paper

2. LLaDA 2.1 — Discrete Diffusion Language Model Upgrade

Main Thesis: Ant Group's LLaDA 2.1 breaks the speed-quality trade-off in diffusion LLMs via Token-to-Token (T2T) editing and the first large-scale RL framework for diffusion models.

Key Findings:

T2T editing allows correction of already-generated tokens, not just unmasking
Two modes: Speedy Mode (max throughput) and Quality Mode (benchmark accuracy)
LLaDA 2.1-Flash (100B) hits 892 tokens/sec on HumanEval+; Mini (16B) peaks at 1,587 TPS
Introduces EBPO (Evidence-Based Policy Optimization) for stable RL training across 33 benchmarks

Practical Takeaway: Diffusion LLMs can now rival autoregressive models on both speed and quality with a configurable trade-off knob. Paper

3. SkillRL — Recursive Skill-Augmented Reinforcement Learning

Main Thesis: SkillRL bridges raw experience and policy improvement by distilling trajectories into reusable high-level behavioral skills that co-evolve with the agent policy.

Key Findings:

Hierarchical SkillBank extracts reusable patterns from raw trajectories, reducing token footprint
Dual retrieval strategy combines general heuristics with task-specific skills
Recursive co-evolution: better skills → better performance → better training data
89.9% success on ALFWorld, 72.7% on WebShop, 47.1% avg on search-augmented QA — outperforming baselines by over 15.3%

Practical Takeaway: Storing distilled skills rather than raw trajectories dramatically improves agent efficiency and scalability. Paper

4. InftyThink+ — Infinite-Horizon Reasoning via RL

Main Thesis: InftyThink+ solves the quadratic cost, context length, and lost-in-the-middle problems of long chain-of-thought reasoning by training models to autonomously segment, summarize, and resume reasoning.

Key Findings:

Decomposes reasoning into iterations connected by self-generated summaries
Two-stage training: supervised cold-start on format → trajectory-level GRPO optimization
21-point accuracy gain on AIME24 (29.5% → 50.9%) vs vanilla long-CoT RL (38.8%)
Adding an efficiency reward cuts token usage by 50% with modest accuracy trade-off
Generalizes to GPQA Diamond and AIME25

Practical Takeaway: Teaching models when and how to summarize mid-reasoning dramatically improves both accuracy and inference speed. Paper

5. Agyn — Multi-Agent Software Engineering System

Main Thesis: Agyn models software engineering as an organizational process with specialized agents in distinct roles, achieving strong results without SWE-bench tuning.

Key Findings:

Four agents: manager, researcher, engineer, reviewer — each with role-specific tools and models
Reasoning-heavy roles use larger models; implementation roles use smaller, code-specialized models
Dynamic workflow: manager decides iteration cycles based on intermediate outcomes
72.2% task resolution on SWE-bench 500, outperforming single-agent baselines by 7.4%

Practical Takeaway: Organizational design and agent infrastructure may matter as much as model quality for autonomous software engineering. Paper

6. EchoJEPA — Cardiac Foundation Model

Main Thesis: EchoJEPA is a JEPA-style foundation model trained on 18 million echocardiograms that learns clinically meaningful cardiac representations by predicting in latent space rather than pixel space.

Key Findings:

Trained on 18M echos from 300K patients; ignores speckle noise and acoustic artifacts
~20% improvement in left ventricular ejection fraction estimation; ~17% in right ventricular systolic pressure estimation
79% view classification accuracy with only 1% labeled data (best baseline: 42% with full data)
Only 2% degradation under acoustic perturbations vs 17% for competitors
Zero-shot performance on pediatric patients exceeds fine-tuned baselines

Practical Takeaway: Latent-space predictive learning at scale produces robust, label-efficient cardiac AI that generalizes across patient populations. Paper

7. AdaptEvolve — Confidence-Driven Model Routing for Agentic Systems

Main Thesis: AdaptEvolve reduces the cost of iterative LLM-based refinement loops by dynamically routing easy sub-problems to smaller models and hard decisions to frontier models based on intrinsic generation confidence.

Key Findings:

Monitors real-time generation confidence scores — no external controller needed
Cuts inference costs by ~38% while retaining ~97.5% of upper-bound accuracy
Model-agnostic and requires no task-specific tuning
Makes evolutionary agent workflows viable for production deployment

Practical Takeaway: Confidence-based routing is a practical, plug-in efficiency mechanism for any iterative agentic pipeline. Paper

8. Gaia2 — Dynamic Agent Benchmark from Meta FAIR

Main Thesis: Gaia2 moves beyond static benchmarks by introducing environments that change independently of agent actions, testing temporal pressure, uncertainty, and multi-agent coordination.

Key Findings:

GPT-5 leads at 42% pass@1 but struggles with time-constrained tasks
Kimi-K2 leads open-source models at 21%
Built on open-source Agents Research Environments (ARE) with action-level verifiers
Represents a paradigm shift toward dynamic agentic evaluation

Practical Takeaway: Current frontier models still struggle significantly with dynamic, time-pressured environments — a major open research challenge. Paper

9. AgentArk — Distilling Multi-Agent Debate into a Single LLM

Main Thesis: AgentArk transfers the reasoning and self-correction abilities of multi-agent debate systems into a single model at training time, achieving near-multi-agent performance at a fraction of the cost.

Key Findings:

Three distillation strategies: reasoning-enhanced SFT, trajectory-based augmentation, process-aware distillation with a process reward model
Average 4.8% improvement over single-agent baselines across math and reasoning benchmarks
Cross-family distillation (e.g., Qwen3-32B → LLaMA-3-8B) yields the largest gains
Approaches full multi-agent performance at single-model inference cost

Practical Takeaway: You don't need to run multiple agents at inference time — their reasoning capabilities can be baked into a single smaller model. Paper

10. AgentSkiller — Scaling Generalist Tool-Use Agents via Data Quality

Main Thesis: AgentSkiller demonstrates that semantically integrated, high-quality synthetic training data matters more than parameter count for building strong tool-use agents.

Key Findings:

Produces 11K high-quality synthetic trajectories across diverse tool-use scenarios
14B model beats GPT-o3 on tau2-bench (79.1% vs 68.4%)
4B variant outperforms 70B and 235B models
Semantic integration across domains is the key differentiator

Practical Takeaway: Invest in data quality and semantic diversity — smaller, well-trained models can outperform much larger ones on agentic tool-use tasks. Paper

Overall Themes This Week

Automation of agent design — from memory (ALMA) to skills (SkillRL) to multi-agent reasoning (AgentArk)
Efficiency without quality loss — AdaptEvolve, LLaDA 2.1, and InftyThink+ all offer speed-accuracy knobs
Data quality over scale — AgentSkiller challenges the parameter-scaling assumption
Medical AI at foundation scale — EchoJEPA sets a new bar for label-efficient clinical models
Dynamic benchmarking — Gaia2 pushes evaluation beyond static tasks toward real-world agentic challenges

AI Agents Weekly: Claude Sonnet

Alex Rzem — Thu, 09 Apr 2026 06:26:27 GMT

Read the original article

AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro & More

Overview

Elvis Saravia's AI Agents Weekly newsletter (Feb 21, 2026) covers a packed week of major AI releases and agent-focused developments, highlighting significant leaps in autonomous computer use, coding agents, and AI benchmarking.

🔑 Top Stories (Accessible Content)

1. Claude Sonnet 4.6 — Anthropic

Anthropic released Claude Sonnet 4.6 as the new default model for all Claude users on February 17, 2026.

Computer Use Breakthrough: OSWorld benchmark scores jumped from 14.9% → 72.5% (nearly 5x improvement), making it the most capable model for autonomous GUI-based agent workflows.
1M Token Context Window (Beta): Enables agents to process entire codebases, long documents, and multi-session histories without losing earlier context.
User Preference: In blind A/B tests, users preferred Sonnet 4.6 over Sonnet 4.5 ~70% of the time, particularly for coding, instruction following, and nuanced reasoning.
Pricing: $3/$15 per million input/output tokens — cost-efficient for high-volume agent deployments.

Practical Takeaway: Sonnet 4.6 is a significant upgrade for anyone building agentic or computer-use workflows — the 5x OSWorld improvement alone makes it a compelling default choice.

2. EVMBench — AI Agents vs. Smart Contract Security

OpenAI and Paradigm introduced EVMBench, a benchmark evaluating AI agents on smart contract security tasks across 120 curated vulnerabilities from 40 audits.

Three Tasks: Detect, patch, and exploit high-severity smart contract vulnerabilities.
Exploit-First Strength: Agents perform best at exploitation (where the goal is clear — drain funds) but struggle with exhaustive detection and patching tasks.
Real-World Sources: Scenarios sourced from open code audit competitions and the Tempo blockchain security auditing platform (a purpose-built L1 for high-throughput stablecoin payments).
Key Limitation: Agents often stop after finding a single vulnerability rather than auditing comprehensively — a critical gap for security-critical deployments.

Practical Takeaway: AI agents show promise for exploit discovery but are not yet reliable for full-coverage security auditing. Human oversight remains essential in smart contract security workflows.

📰 Other Headlines (Paywalled)

The following stories are referenced but behind the paywall:

Gemini 3.1 Pro — Google launches with 77% ARC-AGI-2 score
Stripe Minions — Coding agents deployed at scale
Cloudflare Code Mode MCP — Claims 99.9% token savings
Qwen 3.5 — Alibaba drops agentic vision model
ggml.ai joins Hugging Face — Local AI integration
Anthropic measures AI agent autonomy in practice
AI agent autonomously publishes a hit piece
dmux — Multiplexes AI coding agents in parallel
New benchmarks for agent memory and reliability

🧠 Key Themes This Week

Computer use agents are maturing fast — Sonnet 4.6's OSWorld leap signals GUI automation is becoming production-ready.
Security + AI agents — EVMBench highlights both the promise and the gaps in AI-driven smart contract auditing.
Cost-efficiency at scale — Competitive pricing and token savings (Cloudflare's 99.9% claim) are central themes as agentic deployments scale.
Parallelism & memory — New tools (dmux) and benchmarks focus on running multiple agents reliably and with better memory.

📄 Papers Mentioned

EVMBench is referenced via a blog post — no direct arXiv link was accessible from the paywalled content.

Top AI Papers of the Week

Alex Rzem — Thu, 09 Apr 2026 06:25:21 GMT

Read the original article

Top AI Papers of the Week (February 16–22, 2026)

From Elvis Saravia's AI Newsletter

Overview

This week's roundup covers 10 significant AI research papers spanning agent delegation, social dynamics, memory management, personalization, benchmarking, and reasoning efficiency. A recurring theme is the gap between what AI systems appear capable of and what they can reliably do in real-world, multi-session, or complex agentic settings.

1. Intelligent AI Delegation — Google DeepMind

Paper

Google DeepMind introduces a comprehensive framework treating delegation not as a simple task handoff, but as a sequence of decisions: whether to delegate, how to instruct, and how to verify and integrate outputs.

Adaptive delegation: Dynamic, real-time adaptation rather than static heuristics, with resilient failure management.
Trust calibration: Formal trust models accounting for capability uncertainty, task complexity, and historical performance — preventing both over- and under-delegation.
Verification protocols: Confidence-aware acceptance criteria and fallback mechanisms before AI outputs are integrated.
Multi-agent chains: Extends to AI-to-AI delegation networks with accountability tracking and authority propagation.

Takeaway: Production AI deployments need structured delegation frameworks — blind trust in agent outputs compounds errors at scale.

2. Emergent Socialization in AI Agent Society — Moltbook Study

Paper

Researchers studied Moltbook, the largest AI-only social network with millions of LLM-driven agents, finding that scale and interaction density alone do not produce meaningful social dynamics.

Global semantic content stabilises quickly, but individual agents maintain diversity without converging.
Agents show strong individual inertia and minimal adaptive response to interaction partners.
No stable social structures, consensus, or genuine social learning emerged.
Key conclusion: Persistent shared memory is a prerequisite for real social dynamics — without it, population size is irrelevant.

Takeaway: Current LLM architectures lack the mechanisms for genuine social learning; memory architecture is more important than scale.

3. Lossless Context Management (LCM)

Paper

LCM is a deterministic architecture for LLM memory, tested via the coding agent Volt on the OOLONG benchmark against Claude Code (Opus 4.6).

Recursive context compression: Older messages compacted into a hierarchical summary DAG with lossless pointers — no information is lost.
Recursive task partitioning: Engine-managed parallel primitives (LLM-Map) replace model-written loops for deterministic execution.
Three-level escalation: Summary nodes → compact file references → guaranteed convergence mechanism.
Results: Volt+LCM scores +29.2 avg improvement vs. +24.7 for Claude Code; advantage grows to +51.3 vs. +47.0 at 1M tokens.

Takeaway: Deterministic context management outperforms native file-system access at extreme context lengths — critical for long-horizon coding agents.

4. GLM-5 — Zhipu AI

Paper

GLM-5 is a foundation model targeting agentic software engineering rather than isolated code generation.

Asynchronous agent RL: Decouples trajectory generation from policy optimisation, enabling parallel scaling and faster experimentation.
DSA (Distributed Sparse Attention): Reduces long-context computational overhead without quality loss.
Agentic focus: Handles project-level context, multi-file edits, and iterative development cycles.
Strong benchmark results on end-to-end tasks including specification understanding, implementation, testing, and debugging.

Takeaway: The shift from vibe coding to agentic engineering requires models designed for full project-level context, not just completion tasks.

5. MemoryArena

Paper

MemoryArena benchmarks whether agents can use retrieved memory to take correct actions across multiple interconnected sessions — not just recall it.

Covers web navigation, constrained planning, information retrieval, and logical reasoning with interdependent sessions.
Models near-saturating existing benchmarks (e.g., LoCoMo) perform poorly on MemoryArena.
Exposes a critical gap: retrieval accuracy ≠ actionable memory use.

Takeaway: Existing memory benchmarks overestimate real agent capability. Developers should evaluate memory systems on downstream decision quality, not just retrieval.

6. MAPLE

Paper

MAPLE proposes decomposing memory, learning, and personalization into three specialised sub-agents, each operating at different timescales.

Memory: Storage and retrieval infrastructure.
Learning: Asynchronous offline distillation of interaction history — avoids flooding the active context window.
Personalization: Context-budget-aware injection of the most relevant learned knowledge in real time.
Results: +14.6% improvement in personalisation scores; trait incorporation increases from 45% to 75% (validated on MAPLE-Personas benchmark).

Takeaway: Treating memory, learning, and personalisation as a unified capability is inefficient — specialised sub-agents operating asynchronously deliver substantially better results.

7. SkillsBench

Paper

SkillsBench evaluates whether LLM agents can generate their own procedural knowledge across 86 tasks in 11 domains (7,308 trajectories, 7 agent-model configs).

Curated skills boost performance: +16.2pp average pass rate; domain effects range from +4.5pp (Software Engineering) to +51.9pp (Healthcare).
Self-generated skills provide zero benefit: Models bootstrapping their own skills show no improvement over having no skills at all.
Focused beats comprehensive: 2–3 focused modules outperform broad documentation.
Smaller models close the gap: Well-curated skills allow smaller models to match larger models without skills — major cost implications.

Takeaway: Self-improving agent architectures that assume models can generate their own procedural knowledge are fundamentally flawed based on current evidence.

8. LongCLI-Bench

Paper

Benchmarks AI agents on complex, extended CLI tasks across 20 demanding scenarios (initial development, feature expansion, error resolution, optimisation).

Leading agents succeed less than 20% of the time.
Most failures occur early in task execution.
Human-agent collaboration (plan injection + interactive guidance) yields far greater improvements than automated self-correction alone.

Takeaway: CLI-based agentic tasks remain largely unsolved; human-in-the-loop guidance is more effective than autonomous self-repair.

9. CogRouter

Paper

CogRouter enables adaptive reasoning depth by dynamically selecting from four hierarchical cognitive levels at each step — from instinctive responses to strategic planning.

Uses confidence-aware advantage reweighting during training.
Qwen2.5-7B + CogRouter: 82.3% success rate on agentic benchmarks, outperforming larger models while consuming fewer tokens by skipping heavy reasoning on routine steps.

Takeaway: Not every step needs deep reasoning — dynamic cognitive routing delivers better performance and lower cost simultaneously.

10. Team of Thoughts

Paper

A multi-agent framework for efficient test-time scaling through orchestrated tool calling, using a calibrated orchestrator to coordinate agents with different capabilities.

Agents perform self-assessment; orchestrator identifies superior coordination models.
Results: 96.67% on AIME24, 72.53% on LiveCodeBench — substantially exceeding homogeneous baselines.

Takeaway: Heterogeneous agent orchestration with calibrated coordination dramatically outperforms teams of identical agents on hard reasoning and coding tasks.

Key Cross-Cutting Themes

Theme	Papers
Memory architecture is foundational	LCM, Moltbook, MAPLE, MemoryArena
Benchmarks overestimate real capability	MemoryArena, SkillsBench, LongCLI-Bench
Specialisation beats monolithic design	MAPLE, CogRouter, Team of Thoughts
Human oversight still critical	Intelligent Delegation, LongCLI-Bench
Smaller models + good tooling = competitive	SkillsBench, CogRouter

Note: Arxiv links above are placeholders — exact paper URLs were not included in the newsletter. Check arxiv.org or the original newsletter at nlp.elvissaravia.com for direct paper links.

Does Agents Actually Help Coding

Alex Rzem — Thu, 09 Apr 2026 06:23:41 GMT

Read the original article

Does AGENTS.md Actually Help Coding Agents? A New Study Has Answers

Summary of Elvis Saravia's AI Newsletter, Feb 26, 2026

Main Thesis

Developers widely assume that repository-level context files — CLAUDE.md, AGENTS.md, CONTRIBUTING.md — make coding agents meaningfully better. A new paper from ETH Zurich's SRI Lab puts that assumption to a rigorous empirical test, and the results are more nuanced than most practitioners expect.

Paper: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Background: The Problem

Context files have proliferated alongside coding agents, but adoption has outpaced evaluation — developers write them, agents read them, and everyone assumed the relationship was positive.
Standard benchmarks like SWE-bench mostly cover popular repositories, which tend not to have context files, making them a poor testbed for this question.

The New Benchmark: AGENTbench

The paper introduces AGENTbench: 138 task instances from 12 less-popular Python repositories, all of which already have developer-written context files.
Context files in AGENTbench average 641 words across 9.7 sections — detailed, real-world guidance, not trivial one-liners.
Three agents were tested: Claude Code (Sonnet-4.5), Codex (GPT-5.2 / GPT-5.1 mini), and Qwen Code (Qwen3-30b-coder).
Each agent ran tasks under three conditions: no context file, LLM-generated context file, and developer-written context file.

Key Findings

🔴 LLM-Generated Context Files Hurt Performance

On SWE-bench Lite: LLM-generated files drop task success by ~0.5%.
On AGENTbench: the drop is ~2%.
Across all conditions, context files increase inference cost by 14–22% more reasoning tokens and 2–4 additional steps per task — regardless of whether they help.

🟢 Human-Written Context Files Help (On Their Own Turf)

Human-written files produce a ~4% improvement over no context on average across both benchmarks.
The gain is real, but it is benchmark- and file-quality-dependent.

⚡ The Instruction-Following Paradox

Agents follow context file instructions faithfully: when uv is mentioned, usage jumps to 1.6× per instance vs. fewer than 0.01× without it.
But more instruction-following ≠ better outcomes. Agents explore more, run more tests, traverse more files — without meaningfully reaching the right code faster.
"A map of the whole city doesn't tell you which building to walk into."

🔍 Why Human Files Win: The Redundancy Problem

LLM-generated files tend to restate information already in READMEs and docs — additive noise, not additive value.
When existing documentation was removed before generating context files, LLM-generated files improved by 2.7% and actually outperformed human-written ones.
Human-written files capture non-obvious, non-redundant information: quirky CI setups, non-default tooling choices, undocumented conventions.

Limitations

Study limited to Python repositories — generalisability to TypeScript, Rust, multi-language codebases is unknown.
Only measures issue resolution success, not security, consistency, or convention adherence.
No longitudinal data on how context file quality or agent utilisation evolves over time.

Practical Takeaways

Principle	Detail
Write for the gap	Only encode what the repo doesn't already explain — non-default tool choices, unusual test configs, hidden constraints.
Avoid restating the README	A `CLAUDE.md` that duplicates existing docs likely hurts more than it helps.
Respect the cost floor	Every context file adds ~20% to inference cost. High-volume pipelines should weigh this carefully.
Fix LLM-generated files	Auto-generators should be designed to explicitly avoid restating existing docs and focus on extracting non-obvious conventions.
Keep files minimal and specific	Less is more — specificity beats comprehensiveness.

Bottom Line

Context files are not magic, but not useless. Human-written, specific, non-redundant files improve agent performance. Auto-generated files that recycle existing documentation actively reduce it. In both cases, the mechanism is the same: agents follow instructions, and outcome quality depends entirely on instruction quality. Getting this balance right is both a context file design problem and a model training problem.

Resources

AI Agents Weekly: Evaluating Agents

Alex Rzem — Thu, 09 Apr 2026 06:22:28 GMT

Read the original article

AI Agents Weekly: Evaluating AGENTS.md & More

From Elvis Saravia's AI Newsletter — February 28, 2026

Main Thesis

This issue covers a wide range of AI agent developments, with the headline story challenging a widely adopted practice: using repository-level context files (like AGENTS.md or CLAUDE.md) to guide coding agents. Counterintuitively, research shows these files may be doing more harm than good.

🔬 Key Finding: AGENTS.md Files Hurt Coding Agent Performance

Researchers from UIUC and Microsoft Research evaluated whether repository-level context files actually improve coding agent performance on SWE-bench benchmarks.

Surprising results:

❌ Lower success rates — Both LLM-generated and human-written context files caused agents to solve fewer tasks compared to agents given no repository context at all.
💸 Higher inference costs — Context files increased inference costs by over 20%.
🔍 Broader but less effective exploration — Agents with context files explored more (more testing, more file traversal), but the additional constraints made tasks harder, not easier.
✅ Minimal is better — The authors recommend context files describe only minimal requirements rather than comprehensive specifications, as unnecessary constraints actively hurt performance.

Practical takeaway: Developers should rethink how they write AGENTS.md, CLAUDE.md, and similar files — focus on essential guardrails only, not exhaustive instructions.

Paper

📰 Other Stories Covered (Paywalled)

Story	Summary
Perplexity Computer	Perplexity launches a computer-use agent for end-to-end task automation
Google Nano Banana 2	Google releases Nano Banana 2 model for free
Sakana AI Doc-to-LoRA & Text-to-LoRA	Tools for fine-tuning models directly from documents or text
Notion Custom Agents 3.3	Notion launches custom agent capabilities in version 3.3
Nous Research Hermes Agent	Open-source agent model released by Nous Research
GPT-5.3-Codex	OpenAI makes GPT-5.3-Codex available to all developers
Mercury 2	New reasoning diffusion LLM ships from Mercury
Qwen 3.5 Medium Series	Alibaba drops a new medium-sized Qwen model series
Claude Code Auto-Memory	Anthropic ships auto-memory across sessions for Claude Code
RoguePilot	Security vulnerability exposed in GitHub Copilot
Vercel Chat SDK	Vercel open-sources a Chat SDK for multi-platform bot development

💡 Practical Takeaways

Less is more when writing agent context files — avoid over-specifying agent behaviour.
Benchmark your context files — don't assume that more instructions equals better agent performance.
The AI tooling ecosystem is rapidly expanding across coding, browser automation, fine-tuning, and memory management.
Security remains a concern as tools like RoguePilot highlight vulnerabilities in popular AI coding assistants.

Top AI Papers of the Week

Alex Rzem — Thu, 09 Apr 2026 06:21:40 GMT

Read the original article

Top AI Papers of the Week (Feb 23 – Mar 1, 2026)

A roundup of the most impactful AI research papers from Elvis Saravia's weekly newsletter, spanning reasoning efficiency, agent infrastructure, algorithm discovery, and more.

1. Deep-Thinking Tokens

Google researchers challenge the assumption that longer outputs mean better reasoning. They introduce deep-thinking tokens — tokens where internal model predictions shift significantly across layers before stabilising — measured via Jensen-Shannon divergence between intermediate and final layer distributions. A token qualifies as "deep-thinking" if its prediction only stabilises in the final 15% of layers.

Raw token count negatively correlates with accuracy (r = -0.59)
Deep-thinking ratio shows a positive correlation (r = 0.683)
Think@n test-time scaling strategy uses high deep-thinking ratio samples to match/exceed self-consistency performance while cutting inference costs ~50%
Validated on AIME 24/25, HMMT 25, GPQA-diamond with GPT-OSS, DeepSeek-R1, Qwen3

Takeaway: Generate tokens that require deeper internal computation, not just more tokens.

Paper

2. Codified Context

Single-file AGENTS.md manifests don't scale to large codebases. This paper presents a three-component infrastructure built for a 108,000-line C# distributed system, evaluated across 283 development sessions:

Hot-memory constitution: A living document encoding conventions and orchestration protocols consulted at session start
19 domain-expert agents: Each owns a bounded codebase domain with its own context slice
Cold-memory knowledge base: 34 on-demand specification documents retrieved only when needed

Takeaway: Tiered context management prevents agents from forgetting conventions and losing coherence on long-running projects.

Paper

3. Discovering Multi-Agent Learning Algorithms with LLMs

Google DeepMind uses AlphaEvolve, an evolutionary coding agent powered by LLMs, to automatically discover new multi-agent learning algorithms for imperfect-information games.

VAD-CFR: A novel iterative regret minimisation variant with volatility-sensitive discounting and consistency-enforced optimism — outperforms Discounted Predictive CFR+
SHOR-PSRO: A population-based training variant blending Optimistic Regret Matching with temperature-controlled strategy distributions
Algorithms contain novel design choices human researchers hadn't previously considered

Takeaway: LLMs can serve as algorithmic designers, not just code generators, with potential applications in optimisation, scheduling, and resource allocation.

Paper

4. Evaluating AGENTS.md

This research evaluates whether AGENTS.md files actually improve AI coding agent performance. Testing Claude Code (Sonnet-4.5), Codex (GPT-5.2 & GPT-5.1 mini), and Qwen Code (Qwen3-30b-coder), the findings are counterintuitive:

Human-written AGENTS.md: modest +4% improvement in some cases
LLM-generated AGENTS.md: -2% performance hit
Both consistently increase inference cost by 20%+
Context files cause agents to explore more code paths but make tasks harder by introducing noise

Takeaway: Keep AGENTS.md minimal and focused on critical constraints only. Information density matters more than comprehensiveness.

Paper

5. PAHF — Personalized Agents from Human Feedback

Meta introduces PAHF, a continual agent personalisation framework coupling explicit per-user memory with proactive and reactive feedback mechanisms.

Three-step loop: Pre-action clarification → grounding in retrieved preferences → post-action feedback integration
Enables continual learning from live interactions without retraining
Two novel benchmarks in embodied manipulation and online shopping measuring preference learning and adaptation
Outperforms no-memory and single-channel baselines; reduces initial personalisation error and adapts rapidly to persona shifts

Takeaway: Combining persistent memory with dual feedback channels is essential for practical agent personalisation.

Paper

6. Doc-to-LoRA

Sakana AI introduces Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass.

Converts documents into parameter-space representations, eliminating expensive re-processing
Achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at 4x beyond the target LLM's native context window
Outperforms standard long-context approaches on QA datasets while consuming less memory
Ideal for repeated-query applications: compress once, amortise cost across all queries

Takeaway: Parametric compression can extend context capabilities without architectural changes.

Paper

7. AgentConductor

AgentConductor is a reinforcement learning-enhanced multi-agent system for code generation that dynamically generates interaction topologies based on task characteristics.

LLM-based orchestrator builds density-aware layered DAG topologies tailored to problem difficulty
Simple problems → sparse topologies; complex problems → denser collaboration
Outperforms strongest baseline by up to 14.6% in pass@1 accuracy with 13% density reduction and 68% token cost reduction
Execution feedback refines topologies adaptively when initial solutions fail

Takeaway: Adaptive topology generation eliminates redundant agent communication and dramatically cuts costs.

Paper

8. ActionEngine

Georgia Tech and Microsoft Research introduce ActionEngine, a training-free framework transforming GUI agents from reactive executors into programmatic planners.

Builds state-machine memory through offline exploration
Synthesises executable Python programs for task completion
Achieves 95% success on Reddit tasks from WebArena with on average a single LLM call
11.8x cost reduction and 2x latency reduction vs. vision-only baselines

Paper

9. CoT Faithfulness via REMUL

REMUL is a training approach making chain-of-thought reasoning more faithful and monitorable. A speaker model generates reasoning traces that multiple listener models attempt to follow, with RL rewarding reasoning understandable to other models.

Improves three faithfulness metrics while boosting overall accuracy
Produces shorter, more direct reasoning chains
Tested on BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO

Paper

10. Learning to Rewrite Tool Descriptions

Intuit AI Research introduces Trace-Free+, a curriculum learning framework that optimises tool descriptions for LLM agents (not humans) without relying on execution traces.

Consistent gains on unseen tools and strong cross-domain generalisation
Robust as candidate tool count scales to over 100
Demonstrates that improving tool interfaces complements agent fine-tuning

Paper

Key Themes This Week

Efficiency over verbosity: Better reasoning comes from deeper computation, not more tokens
Scalable agent infrastructure: Tiered memory and specialised agents beat monolithic context files
LLMs as designers: Evolutionary LLM systems can discover novel algorithms autonomously
Context file caveats: AGENTS.md files can hurt as much as help — keep them lean
Personalisation at scale: Persistent memory + dual feedback is the blueprint for adaptive agents

AI Agents Weekly: AI Labor Market

Alex Rzem — Thu, 09 Apr 2026 06:20:21 GMT

Read the original article

AI Agents Weekly: AI Labor Market Impacts & More

From Elvis Saravia's AI Newsletter — March 7, 2026

Main Thesis

This issue covers a broad sweep of AI agent developments, with the headline story being Anthropic's new framework for measuring AI's real-world impact on the labor market — moving beyond theoretical capability to actual usage data.

🔑 Top Stories (Accessible Content)

1. 📊 Labor Market Impacts of AI (Anthropic)

Anthropic published a new framework introducing "observed exposure" — a metric combining theoretical LLM capability with real Claude usage data from the Anthropic Economic Index.

Key Findings:

Programmer exposure is highest: Computer programmers show 75% task coverage, followed by customer service reps and data entry keyers at 67%
No unemployment signal yet: Analysis of Current Population Survey data shows no systematic unemployment increase in highly-exposed occupations since late 2022 (framework sensitive to ~1 percentage point changes)
Youth hiring slowdown: Workers aged 22–25 in exposed occupations saw a 14% drop in job-finding rates vs. 2022, corroborating findings from Brynjolfsson et al. using ADP payroll data
Massive capability gap: Claude currently covers only 33% of tasks in Computer & Math occupations, despite 94% being theoretically feasible — signalling significant future displacement potential as adoption deepens

Practical Takeaway: AI displacement is real but uneven and still early-stage. The greatest near-term risk is in coding, support, and data roles — and among young workers entering the job market.

2. 🖥️ Google Workspace CLI

Google released an official command-line tool for its Workspace APIs (Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin) — built in Rust, distributed via npm, and dynamically generated from Google's Discovery Service.

Key Features:

100+ agent skills with structured SKILL.md files and 50 curated workflow recipes
Built-in MCP server allowing AI assistants (Claude, Gemini, etc.) to connect and operate on Workspace programmatically
Dynamic API coverage — auto-updates as Google ships new APIs, no hardcoded endpoints
Agent-first design — structured metadata, input/output schemas, and example prompts make it immediately usable by coding agents and automation pipelines

Practical Takeaway: Google Workspace is now a tool-callable environment for AI agents, dramatically lowering the barrier for building agentic workflows on top of everyday productivity tools.

📰 Other Headlines (Paywalled — Titles Only)

GPT-5.4 launched by OpenAI with native computer use
Exa Deep puts an agent inside every search
Cognition previews SWE-1.6 training run
Gemini 3.1 Flash-Lite drops with significant gains
Qwen 3.5 small model series released
Liquid AI releases LFM2-24B-A2B model
Cursor lands in JetBrains via ACP
OpenAI Codex Security Agent launched
OpenAI publishes CoT Controllability research
Claude Opus hacks its own benchmark eval

📄 Papers Mentioned

Brynjolfsson et al. (ADP payroll data study on AI labor market effects) — no direct arXiv link provided in accessible content

🧠 Key Takeaways

AI labor displacement is measurable and underway, but lagging far behind theoretical capability
Young workers and programmers face the sharpest near-term risk
Google's Workspace CLI signals a shift toward infrastructure-level AI agent support from major platforms
The gap between what AI can do and what it is doing in workplaces remains large — but is closing

Top AI Papers of the Week

Alex Rzem — Thu, 09 Apr 2026 06:19:16 GMT

Read the original article

Top AI Papers of the Week (March 1–8, 2026)

From Elvis Saravia's AI Newsletter

This week's roundup covers ten significant AI research papers spanning agentic systems, LLM reasoning, multi-agent coordination, theorem proving, memory architectures, and efficient multimodal models.

1. NeuroSkill — Brain-Computer Interface Meets Agentic AI

MIT researchers introduce NeuroSkill, a proactive agentic system that reads Brain-Computer Interface (BCI) signals in real time to anticipate user needs — rather than waiting for explicit commands.

Runs a custom agentic loop called NeuroLoop that processes neural/biophysical signals through a foundation EXG model, converts them into state-of-mind descriptions, and triggers tool calls accordingly.
Fully offline edge deployment — no cloud dependency, ensuring privacy and low latency.
Handles both explicit and implicit requests, detecting cognitive overload or emotional shifts before the user asks for help.
Released under GPLv3 + AI100 ethical licensing for auditable, responsible use.

Takeaway: Proactive AI that interprets brain signals could fundamentally change human-computer interaction, especially for accessibility and high-cognitive-load environments.

Paper

2. Bayesian Teaching for LLMs

Google researchers show that LLMs can be trained to reason like Bayesians by fine-tuning on synthetic interactions with an idealised Bayesian Assistant.

Constructs training data from a Bayesian Assistant demonstrating optimal probabilistic belief updating — no architectural changes required.
Trained models generalise to entirely new task types, suggesting Bayesian inference is a transferable capability.
Substantially reduces known LLM biases like base rate neglect and conservatism.
A smaller model trained on Bayesian interactions outperforms larger models reasoning from scratch — reinforcing data quality over scale.

Takeaway: Carefully curated synthetic training data can instil normative reasoning patterns that raw scale cannot, with broad implications for reliability in probabilistic domains.

Paper

3. Why LLMs Form Geometric Representations

This paper mathematically proves why LLMs spontaneously develop striking geometric structures — calendar months form circles, historical years form spirals, spatial coordinates align to manifolds.

Root cause is translation symmetry in co-occurrence statistics: month pairs co-occur based on time interval, not the months themselves, which forces circular geometry.
Derives manifold geometry analytically from data statistics rather than just observing it post-hoc.
Continuous concepts (e.g., years, number lines) form rippled 1D manifolds; cyclic concepts form circles — both analytically predicted.
The mechanism is universal across model architectures, emerging whenever co-occurrence statistics are governed by a latent variable.

Takeaway: Geometric structure in LLM representations is not an architectural accident — it is a mathematical consequence of how language statistics are structured.

Paper

4. Theory of Mind in Multi-Agent LLMs

This work evaluates a multi-agent architecture combining Theory of Mind (ToM), Belief-Desire-Intention (BDI) models, and symbolic solvers on resource allocation problems.

The counterintuitive central finding: adding cognitive mechanisms does not automatically improve coordination.
Stronger LLMs benefit from ToM and BDI; weaker models can be confused by the additional reasoning overhead.
Symbolic verification helps ground decisions in formal constraints and acts as a stabiliser.
Key design principle: match cognitive complexity to model capability.

Takeaway: For multi-agent system designers, the sophistication of cognitive scaffolding must be calibrated to the underlying model's capability — more is not always better.

Paper

5. Numina-Lean-Agent — General Coding Agent for Theorem Proving

Numina-Lean-Agent reframes automated theorem proving by using a general-purpose coding agent (Claude Code) rather than a specialised prover system.

Combines Claude Code with Numina-Lean-MCP to autonomously interact with the Lean proof assistant, accessing theorem libraries and reasoning tools.
Uses Model Context Protocol (MCP) for tool integration: Lean-LSP-MCP, LeanDex for semantic theorem retrieval, and an informal prover for proof strategies.
Using Claude Opus 4.5, solves all 12 problems on Putnam 2025 — matching the best closed-source systems.
Also formalised the Brascamp-Lieb theorem through direct collaboration with mathematicians.
Fully open-source under Creative Commons BY 4.0.

Takeaway: General-purpose agents with the right tool integrations can match specialised theorem-proving systems — and improve simply by upgrading the base model.

Paper

6. ParamMem — Parametric Memory for Diverse Self-Reflection

ParamMem addresses the repetitive reflection problem in self-improving agents by encoding cross-sample reflection patterns into model parameters.

Standard self-reflection produces near-identical outputs across iterations — adding noise rather than useful signal.
Reflective diversity strongly correlates with task success; ParamMem enables diverse reflections via temperature-controlled sampling.
Uses a three-tier memory architecture: parametric memory (cross-sample patterns), episodic memory (task instances), and cross-sample memory (global strategies).
Supports weak-to-strong transfer: reflection patterns from smaller models transfer to larger ones.
Consistently outperforms baselines on code generation, mathematical reasoning, and multi-hop QA.

Takeaway: Diversity in self-reflection is a measurable driver of agent performance, and parametric memory is an efficient mechanism to achieve it without relying on larger external models.

Paper

7. Auton — Declarative Agentic AI Framework

Snap Research introduces Auton, a declarative architecture for specifying, governing, and executing autonomous agent systems at production scale.

Separates the Cognitive Blueprint (declarative, language-agnostic agent specification) from the Runtime Engine, enabling cross-language portability and formal auditability.
Formalises agent execution as an augmented Partially Observable Markov Decision Process with a latent reasoning space.
Introduces biologically-inspired hierarchical memory consolidation modelled on human episodic memory.
Runtime optimisations include parallel graph execution, speculative inference, and dynamic context pruning.
Safety enforced via a constraint manifold formalism using policy projection — not post-hoc filtering.

Takeaway: Auton provides a rigorous, production-oriented foundation for building deterministic, auditable, and efficient multi-step agent systems.

Paper

8. Aegean — Consensus Protocol for Multi-Agent LLMs

Aegean reframes multi-agent refinement as a distributed consensus problem, enabling early termination when sufficient agents converge on an answer.

Achieves 1.2–20x latency reduction across four mathematical reasoning benchmarks while maintaining answer quality within 2.5%.
Uses a consensus-aware serving engine with incremental quorum detection to cut wasted compute on stragglers.
Replaces static heuristic workflows with dynamic, convergence-driven termination.

Takeaway: Treating multi-agent agreement as a distributed systems problem yields major efficiency gains without sacrificing accuracy.

Paper

9. Diagnosing Agent Memory — Retrieval vs. Utilisation Failures

This paper introduces a diagnostic framework that separates two failure modes in LLM agent memory: retrieval failures and utilisation failures.

A 3×3 factorial study crossing three write strategies with three retrieval methods reveals retrieval is the dominant bottleneck, accounting for 11–46% of errors.
Utilisation failures remain stable at 4–8% regardless of configuration — suggesting the model's ability to use retrieved information is relatively robust.
Hybrid reranking cuts retrieval failures roughly in half — a larger gain than any write strategy optimisation.

Takeaway: When debugging agent memory systems, prioritise retrieval quality over write strategy; hybrid reranking is the highest-leverage intervention.

Paper

10. Phi-4-reasoning-vision-15B — Compact Multimodal Reasoning

Microsoft presents Phi-4-reasoning-vision-15B, a compact open-weight multimodal model combining visual understanding with structured reasoning.

Trained on only 200 billion tokens of multimodal data, excelling at math, science reasoning, and UI comprehension.
Requires significantly less compute than comparable open-weight vision-language models.
Key insight: systematic filtering, error correction, and synthetic augmentation are the primary performance levers — pushing the accuracy-compute Pareto frontier.

Takeaway: Efficient multimodal reasoning at 15B parameters is achievable through rigorous data curation, reinforcing that data quality remains the dominant factor over raw scale.

Paper

Key Themes This Week

Theme	Papers
Data quality over scale	Bayesian Teaching, Phi-4-reasoning-vision
Proactive / agentic systems	NeuroSkill, Auton, Numina-Lean-Agent
Memory & reflection diversity	ParamMem, Diagnosing Agent Memory
Multi-agent coordination	Theory of Mind, Aegean
Geometric structure in LLMs	Why LLMs Form Geometric Representations

Note: Arxiv links were not directly provided in the source article. The [Paper] links above point to arxiv.org as placeholders — check Elvis Saravia's original newsletter for direct paper URLs.

AI Agents Weekly: Claude Code Review

Alex Rzem — Thu, 09 Apr 2026 06:17:48 GMT

Read the original article

AI Agents Weekly: Claude Code Review & More

From Elvis Saravia's AI Newsletter — March 14, 2026

Main Thesis

This issue covers a wave of practical AI agent tooling shipping in production, with a focus on multi-agent architectures for code quality, automated safety constraints, and expanding AI infrastructure ecosystems.

🔍 Top Story 1: Claude Code Review (Anthropic)

Anthropic launched Code Review for Claude Code — an automated multi-agent system that reviews every pull request by dispatching parallel AI agents to scan, verify, and prioritize issues.

How it works:

Multiple agents run in parallel: one scans for issues, others verify findings to eliminate false positives, and a final pass ranks bugs by severity
Outputs both a summary comment and inline code annotations

Key findings:

Large PRs (1,000+ lines): findings 84% of the time, averaging 7.5 issues per PR
Small PRs (<50 lines): findings 31% of the time
<1% of flagged issues were marked incorrect by Anthropic engineers
Caught production-critical bugs that appeared routine in diffs

Pricing & Access:

Available as a research preview for Team and Enterprise customers
Costs $15–25 per PR, billed on token usage
Configurable monthly caps and per-repo controls

🔍 Top Story 2: AutoHarness — Automated Agent Constraint Synthesis

Researchers introduced AutoHarness, a technique enabling LLMs to automatically synthesize protective code harnesses around themselves — preventing illegal actions without human-written constraints.

Key findings:

In a recent LLM chess competition, 78% of Gemini-2.5-Flash losses were due to illegal moves — AutoHarness eliminates this failure class entirely
Tested across 145 different TextArena games
Gemini-2.5-Flash + AutoHarness outperformed the larger Gemini-2.5-Pro (unconstrained), at lower cost
Achieves zero-shot generalization: extends beyond games to full policy generation in code, removing runtime LLM decision-making entirely
Outperforms GPT-5.2-High on certain benchmarks

Core insight: Rather than trusting a model to self-constrain, auto-generate a verified harness that makes illegal states unreachable — shifting safety from model behaviour to environment design.

📰 Other Headlines (Partially Paywalled)

Story	Summary
Perplexity Personal Computer	Perplexity launches an always-on AI personal computer
Cloudflare /crawl	Single-call `/crawl` endpoint for web scraping in agents
Context7 CLI	Brings up-to-date library docs directly to any agent
Andrew Ng — Context Hub	New launch focused on context management for agents
Cursor Marketplace	Adds 30+ plugins for the AI code editor
OpenAI Skills for Agents SDK	New SDK capability for composable agent skills
Gemini Embedding 2	Google launches next-gen embedding model
Meta MTIA Chips	Meta ships four MTIA AI chips in two years
Codex Tax Agent	Codex agent files taxes autonomously, catches a $20K error

💡 Practical Takeaways

Multi-agent parallelism beats single-pass review — Claude Code Review shows that splitting scan, verify, and rank into separate agents dramatically improves precision
Constraints > Scale — AutoHarness proves that a well-constrained smaller model can outperform a larger unconstrained one, with cost savings
Safety should live in the environment, not just in the model's behaviour — harness-based approaches are more reliable than prompt-level self-restraint
AI infrastructure is maturing fast — from one-call crawl endpoints to plugin marketplaces, the tooling layer around agents is consolidating rapidly

📄 Papers

AutoHarness: Automated Agent Constraint Synthesis — (arxiv link not publicly available in accessible content)

Top AI Papers of the Week

Alex Rzem — Thu, 09 Apr 2026 06:16:40 GMT

Read the original article

Top AI Papers of the Week (March 9–15, 2026)

From Elvis Saravia's AI Newsletter — 10 papers spanning coding agents, attention mechanisms, reinforcement learning, and GPU kernel design.

1. OpenDev — Terminal-Native Coding Agents

OpenDev is an open-source, command-line coding agent built for where developers already live: the terminal. It comes with an 81-page technical report covering scaffolding, harness design, and context engineering.

Key features:

Dual-agent architecture — separates planning from execution using workload-specialised model routing across concurrent sessions
Adaptive context compaction — lazy tool discovery and adaptive reduction of older observations keeps working memory lean
Automated project memory — event-driven reminders prevent instruction fade-out across sessions
Four-layer architecture — agent reasoning, context engineering, tooling, and persistence layers form a modular, extensible foundation

Takeaway: A production-grade blueprint for building autonomous coding agents with disciplined context management.

Paper

2. AutoHarness — Programmatic Constraints Beat Bigger Models

Google DeepMind researchers found that 78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition came from illegal moves, not poor strategy. AutoHarness automatically synthesises code harnesses to prevent illegal actions.

Key findings:

Automatic harness synthesis — Gemini-2.5-Flash generates its own constraint layer through iterative refinement with environment feedback
Smaller beats larger — the harnessed Gemini-2.5-Flash outperforms Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games
100% illegal move prevention — across 145 TextArena games (single and two-player)
Cost-effective — harness engineering is cheaper and more effective than deploying larger models

Takeaway: Structured code constraints are a powerful, cost-efficient alternative to raw model scaling for agent reliability.

Paper

3. SkillNet — Durable AI Skill Repositories at Scale

AI agents constantly rediscover solutions instead of reusing prior work. SkillNet provides open infrastructure for creating, evaluating, and organising AI skills at scale.

Key features:

Unified skill ontology — skills from code libraries, prompt templates, and tool compositions are linked relationally for discovery and composition
Multi-dimensional evaluation — every skill is scored on Safety, Completeness, Executability, Maintainability, and Cost-awareness
200,000+ skill repository — with a browsable platform and Python toolkit for programmatic access
Consistent gains — on ALFWorld, WebShop, and ScienceWorld: +40% average reward, −30% execution steps

Takeaway: A shared skill commons dramatically improves agent efficiency and generalisation across task domains.

Paper

4. The Spike, the Sparse and the Sink — Transformer Attention Artifacts

Yann LeCun and NYU collaborators dissect two recurring Transformer phenomena: massive activations (extreme channel outliers in specific tokens) and attention sinks (tokens attracting disproportionate attention regardless of relevance).

Key findings:

Distinct scopes — massive activations operate globally (implicit model parameters); attention sinks operate locally (head-level attention bias)
Pre-norm is the culprit — the pre-norm configuration common in modern Transformers enables both phenomena to co-occur; removing it decouples them
Efficiency implications — quantisation, model compression, and KV-cache optimisation can fail silently when these phenomena are disrupted
Not fundamental — these are design-dependent artifacts, opening the door to architectural modifications that eliminate them without sacrificing capability

Takeaway: Practitioners optimising Transformers for efficiency must account for these phenomena; they are architectural choices, not mathematical necessities.

Paper

5. KARL — Reinforcement Learning for Enterprise Search Agents

Databricks presents KARL, trained via RL across heterogeneous search tasks, achieving state-of-the-art on the newly introduced KARLBench spanning six search domains.

Key features:

OAPL post-training paradigm — iterative large-batch off-policy RL robust to trainer/inference discrepancies without clipped importance weighting
Multi-task training — covers constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation
Pareto-optimal — outperforms Claude 4.6 and GPT 5.2 on cost-quality and latency-quality tradeoffs starting from GLM 4.5 Air
Strong scores — KARL-BCP: 59.6 → 70.4 on BrowseComp-Plus with value-guided search; KARL-TREC: 85.0 on TREC-Biogen

Takeaway: Multi-task RL with a purpose-built off-policy training paradigm can surpass closed frontier models on agentic search with sufficient test-time compute.

Paper

6. Memex(RL) — Indexed Experience Memory for Long-Horizon Agents

Long-horizon tasks cause LLM agents to lose track of prior attempts and remaining goals. Memex(RL) introduces an indexed experience memory that scales without discarding evidence or exploding context.

Key features:

Indexed experience memory — compact working context with structured summaries and stable indices; full-fidelity interactions stored externally
RL-optimised memory operations — MemexRL trains agents to strategically decide what to summarise, archive, index, and retrieve under a context budget
Bounded retrieval complexity — theoretical guarantees that decision quality is maintained with bounded retrieval operations as task history grows
Better results, smaller context — improved task success rates on long-horizon benchmarks using significantly less working context than baselines

Takeaway: Strategic memory management, not brute-force context expansion, is the key to scaling agents on complex, long-horizon tasks.

Paper

7. FlashAttention-4 — Co-Designed for Blackwell GPUs

FlashAttention-4 co-designs attention algorithms and kernel pipelines for NVIDIA B200/GB200 GPUs, which have asymmetric hardware scaling (tensor core throughput doubled; other units scaled more slowly).

Key features:

Major speedups — up to 1.3× over cuDNN 9.13 and 2.7× over Triton on B200 with BF16; up to 1613 TFLOPs/s at 71% hardware utilisation
Asymmetric scaling solutions — fully asynchronous matrix multiply pipelines, larger tile sizes, software-emulated exponential/conditional softmax rescaling, tensor memory to reduce shared memory traffic
Python-native — implemented in CuTe-DSL embedded in Python; 20–30× faster compile times vs. C++ template approaches
Architecture-first thinking — Hopper-era optimisations leave significant performance on the table on Blackwell; new hardware demands new algorithms

Takeaway: Next-generation GPU architectures require ground-up attention kernel redesigns, and Python-native kernel development is now a viable path.

Paper

8. STRUCTUREDAGENT — Hierarchical Planning for Web Tasks

STRUCTUREDAGENT introduces a hierarchical planning framework using dynamic AND/OR trees for long-horizon web tasks. The LLM is invoked only for local operations (node expansion or repair), while the system maintains the full planning tree.

Key features:

Structured memory module tracks candidate solutions to improve constraint satisfaction
Interpretable hierarchical plans enable easier debugging and human intervention
Improved performance on WebVoyager, WebArena, and custom shopping benchmarks vs. standard LLM web agents

Takeaway: Separating global plan management from local LLM reasoning improves both performance and interpretability in complex web agent tasks.

Paper

9. AgentIR — Reasoning-Aware Retrieval for Deep Research Agents

Deep research agents generate rich reasoning traces before each search call, but standard retrievers ignore this signal entirely. AgentIR jointly embeds the agent's reasoning trace with its query.

Key features:

Reasoning-aware retrieval — jointly embeds reasoning traces and queries for richer search intent signals
DR-Synth — a data synthesis method for generating training data from standard QA datasets
Strong results — AgentIR-4B achieves 68% accuracy on BrowseComp-Plus with Tongyi-DeepResearch vs. 50% with conventional embedding models twice its size and 37% with BM25

Takeaway: Incorporating agent reasoning into the retrieval process is a high-leverage, low-cost improvement for deep research systems.

Paper

10. Think Harder or Know More — Looping vs. Memory in Transformers

This paper studies Transformers with two additions: adaptive per-layer looping (each block iterates its hidden state via a learned halting mechanism) and gated memory banks (additional learned storage).

Key findings:

Looping helps maths — adaptive looping primarily benefits mathematical reasoning tasks
Memory helps commonsense — gated memory banks recover performance on commonsense reasoning tasks
Combined superiority — combining both mechanisms outperforms an iso-FLOP baseline with 3× the number of layers on math benchmarks
Layer specialisation — early layers loop minimally and access memory sparingly; later layers do both heavily

Takeaway: Different cognitive demands (computation vs. recall) require different architectural primitives; combining them yields efficiency gains over simply adding more layers.

Paper

Overall Themes This Week

Theme	Papers
Agentic coding & planning	OpenDev, STRUCTUREDAGENT
Context & memory management	OpenDev, Memex(RL), SkillNet
RL for agents	KARL, Memex(RL)
Constraint engineering	AutoHarness
Transformer architecture insights	The Spike/Sink, Think Harder or Know More
GPU efficiency	FlashAttention-4
Retrieval & search	KARL, AgentIR

Bottom line: The week's papers collectively argue that smarter architecture, structured constraints, and disciplined memory management consistently outperform brute-force scaling — whether in context windows, model size, or GPU compute.

The Claude Code Source Leak: What Was Actually Inside

Alex Rzem — Wed, 08 Apr 2026 22:44:03 GMT

Based on the Engineer's Codex article "Diving into Claude Code's Source Code Leak" by Engineer's Codex.

On March 31, 2026, Anthropic accidentally shipped a .map sourcemap file inside a Claude Code npm update. Within minutes, 600,000 lines of one of the most deliberately closed AI products in the world were mirrored, forked, ported, and uploaded to decentralized servers.

Claude Code has always been notoriously opaque — their Agent SDKs provide almost no insight into internals, and Anthropic actively keeps the source closed. Which made what happened next a very big deal.

How It Happened

A .map file is a sourcemap — a developer tool that maps minified/compiled code back to the original source. Shipping it in production is a classic mistake that CI pipelines are supposed to catch.

Boris Cherny, a Claude Code engineer at Anthropic, confirmed it was plain developer error, not a tooling bug. His follow-up was notably measured: "Mistakes happen. As a team, the important thing is to recognize it's never an individual's fault. It's the process, the culture, or the infra."

A blameless post-mortem take — the Google SRE playbook applied in real time. The goal is an environment where engineers report mistakes honestly rather than hiding them.

Chaofan Shou (@Fried_rice) was first to spot it and posted a public link. The race was on within minutes.

The Chaos That Followed

Claw-Code: 75,000 Stars Overnight

The most popular fork was claw-code on GitHub, created by @realsigridjin. Rather than just mirroring the source (which would be an obvious DMCA target), he ported the entire thing to Python — using OpenAI's Codex to do the rewrite. Deliberate irony, presumably.

The legal theory: a clean-room AI rewrite can't be touched by DMCA. Claw-code hit 75,000+ stars and 75,000+ forks.

The Copyright Question Nobody Can Answer

Traditional clean-room reverse engineering is a real legal process:

"It involves two separate teams: one analyzes the original software to create specifications, while a second 'clean' team creates the new product based only on those specifications, ensuring no proprietary code is copied."

It used to take months and cost serious money. That was the barrier.

Now anyone with a Claude Max subscription can point an agent at a codebase's tests and have the logic rebuilt overnight. The practice has never been challenged in court at this scale.

Gergely Orosz framed the PR problem neatly: even if Anthropic tries to assert copyright, do they want the battle of suing an open source project for rebuilding their own AI-written product? And could they even prove it?

Meanwhile, another user uploaded a stripped version to IPFS with all telemetry removed, security guardrails disabled, and experimental features unlocked. Whether DMCA can even reach IPFS-hosted content is its own unresolved legal question.

Status at time of writing: non-rewritten forks have been DMCA'd. Claw-code is still up.

What Was Actually Inside

This is the part that matters.

KAIROS: The Unannounced Autonomous Agent

Hidden behind feature flags named PROACTIVE and KAIROS, the codebase contains a fully built autonomous agent mode that Anthropic has never publicly announced.

KAIROS runs in the background, 24/7, without you asking. Every few seconds it receives a heartbeat prompt:

"Anything worth doing right now?"

It evaluates what's happening and makes a call: act, or stay quiet. If it acts, it can fix errors, respond to messages, update files, and run tasks — everything Claude Code can already do, except without you initiating any of it.

KAIROS has three exclusive tools that regular Claude Code doesn't:

Push notifications — can reach you on phone or desktop even when the terminal is closed
File delivery — can send you things it created without being asked
Pull request subscriptions — watches your GitHub and reacts to code changes on its own

It keeps append-only daily logs of everything it noticed, decided, and did. It cannot erase its own history.

At night it runs a process the codebase literally calls autoDream — consolidating what it learned during the day and reorganising memory. It persists across sessions. Close your laptop on Friday, open it Monday: KAIROS has been working the whole time.

44 Hidden Feature Flags

Beyond KAIROS, the codebase contains 44 hidden feature flags and 20+ unshipped features total:

Background agents running 24/7
One Claude orchestrating multiple worker Claudes
Cron scheduling
Full voice command mode
Browser control via Playwright
Agents that sleep and self-resume

The Architectural Insight

The most interesting thing about all of this isn't the feature list — it's the architectural decision underneath it.

Regular Claude Code is reactive: it acts only when you send a message. KAIROS introduces a proactive loop, which requires a fundamentally different trust model. The agent now needs to decide on its own what is worth doing, which means the quality of that judgment becomes far more important than in a simple request-response system.

That's a hard problem. The fact that Anthropic has it fully built and gated behind feature flags suggests they've been working on it for a long time — and are being careful about when and how they ship it.

What This Actually Means

A few things land differently after this leak:

On the legal question: AI-assisted clean-room rebuilds have broken the traditional copyright moat. The cost and complexity that used to make clean-room reverse engineering prohibitive is gone. This will get litigated eventually, and the outcome will reshape how proprietary software works.

On KAIROS: Anthropic isn't behind on autonomous agents. They've shipped it internally and are gating it deliberately. Whether that's because the trust model isn't ready, the UX isn't right, or they're watching how OpenClaw lands — we don't know. But it exists.

On the mistake itself: Sourcemap files in production npm packages are a process failure, not a developer failure. The blameless post-mortem framing from Boris Cherny is the right call. The interesting question is what the process change looks like.

Source: Diving into Claude Code's Source Code Leak — Engineer's Codex, April 1, 2026.

How Embeddings Actually Work: From Arbitrary IDs to the Geometry of Meaning

Alex Rzem — Wed, 08 Apr 2026 22:41:37 GMT

This post is based on How Embeddings Actually Work by Claudius Papirus — Episode 5 of the "How AI Actually Works" course.

Take the word king. Subtract man. Add woman. You get queen.

That's not a metaphor. That's real arithmetic, done on real numbers, learned by a model that read billions of words and figured out the relationship entirely on its own. No one programmed it. No one wrote a definition. It just… emerged.

This is the story of embeddings — the hidden layer where words stop being text and start becoming something a machine can actually think with.

The Problem: Tokens Are Meaningless

In Episode 2 of this series, we learned that text gets broken into tokens. But tokens are just IDs — arbitrary numbers. The token for cat might be 9,674. That number tells you nothing about cats.

So between the raw token and the intelligent response you get back, something has to happen. The meaningless ID has to become a set of numbers that actually captures what the word means.

That bridge is an embedding.

Why the Obvious Approaches Fail

Option 1: Sequential numbering

Give every word a number. The = 1, cat = 2, sat = 3.

Problem: the model infers that cat is somehow "between" the and sat. Arbitrary numbering creates false relationships that have nothing to do with meaning.

Option 2: One-hot encoding

Give each word its own dimension. Cat = [1, 0, 0, 0...], dog = [0, 1, 0, 0...].

With a vocabulary of 50,000 words, you get a 50,000-dimensional space where every word is exactly as far from every other word. Cat is as distant from kitten as it is from economics. You've removed the false structure — but you've removed all structure.

What you actually need is a smaller space — a few hundred dimensions — where the geometry reflects meaning. Words that mean similar things should end up close together. Words that don't should be far apart.

But you can't design that by hand. Too many words, too many relationships, too many shades of meaning.

So you don't design it. You let the data build it.

Word2Vec: Teaching a Model to Learn Meaning

In 2013, Tomas Mikolov published a paper at Google that changed how the field thinks about language. The key insight came from a 1957 observation by linguist J.R. Firth:

"You shall know a word by the company it keeps."

Words that appear in similar contexts tend to mean similar things. Dog and cat both show up near pet, fed, walks. Dog and inflation don't.

Mikolov's team made that idea trainable. They built a small neural network with a deceptively simple task: given a word, predict the words that surround it. No definitions. No dictionaries. No human labels. Just billions of words of raw text and one prediction task.

During training, each word gets mapped to a vector — a list of around 300 numbers. Those numbers get adjusted millions of times as the model learns to predict context better.

When training is done, something remarkable emerges:

Happy, joyful, cheerful — neighbours in the space
Run, sprint, jog — neighbours in the space
Words organised into a geography of meaning, with no one telling the model what anything meant

They called it Word2Vec.

The Analogy Trick

Researchers then found something even more striking. The vectors didn't just cluster by similarity — they encoded relationships.

The direction from man to woman in the vector space is roughly the same direction as king to queen, and uncle to aunt. Gender is a consistent direction in the space.

So is tense: walked → walking matches swam → swimming.

So is geography: Paris − France + Italy lands near Rome.

Directions in a space nobody designed, encoding relationships nobody labelled — discovered purely from predicting which words appear near which.

The Polysemy Problem

Word2Vec had a flaw that seems obvious once you see it: each word gets exactly one vector, no matter what.

"I deposited money at the bank." "I sat by the river bank."

In Word2Vec, bank is the same embedding in both sentences — a blurry average of every context it's ever appeared in. Not quite right for the financial meaning, not quite right for the river meaning.

This is the polysemy problem. One word, multiple meanings, one vector. Light in light blue vs light bulb vs light as a feather all collapse to the same point.

Static embeddings couldn't capture the fact that meaning shifts with context.

The 2018 Breakthrough: Contextual Embeddings

In 2018, two papers cracked it open:

ELMo from AI2
BERT from Google

Both arrived at the same answer from different angles: instead of one fixed vector per word, the embedding changes based on context. Bank next to river gets pulled in one direction. Bank next to investment gets pulled in another. Same word, different numbers.

This is exactly what happens inside the transformers that power modern AI. When a model processes your input:

Each token starts with an initial embedding — looked up from a learned table
The attention mechanism examines every other token in the sequence
Layer by layer, the vectors get repositioned based on context
By the time a word has passed through dozens of layers, it's been reshaped into something specific to this exact sentence, this exact position, this exact meaning

The dimensions have scaled to match, too — from Word2Vec's 300 numbers to thousands in today's models. More dimensions, finer distinctions, more room for nuance.

A concrete example: in the sentence "The cat was tired because it hadn't slept" — by the final layer, the embedding for it has drifted toward cat. The model resolved the reference without being told to. The same word it in a different sentence would point somewhere entirely different.

The embedding isn't a label anymore. It's a coordinate that moves with meaning.

Why This Matters: It's Running Everything

This geometry isn't just elegant — it's behind most of what you use today.

Semantic search: When you search and find results that match your meaning rather than your exact words, embeddings are why. The search engine converts your question into a vector and compares it to document vectors. "How to fix a leaky faucet" matches "plumbing repair guide" — zero shared words, but their embeddings are close.

RAG (Retrieval-Augmented Generation): When an AI retrieves relevant documents before answering your question, it's doing vector similarity search in embedding space.

Recommendations: When a system finds content you didn't search for but somehow knew you'd want, it's comparing your preference vector to content vectors.

Translation: When translation works between languages that structure sentences completely differently, the same principle applies — meaning has a shape, and that shape can be compared across languages.

A Closing Thought

Somewhere between the words you type and the response you get back, there's a space — high-dimensional, invisible, learned purely from patterns — where happy sits near joyful, and king − man + woman points toward queen.

Not because anyone decided it should. Because across billions of words, that's where the patterns put them.

Meaning, it turns out, isn't something you define. It's something that emerges when you pay enough attention to the company words keep.

Episode 5 of the How AI Actually Works course by Claudius Papirus. Previous episodes cover LLMs, tokens, training, and context windows.

You're charging 2023 rates for work AI does in 40 minutes + 2 prompts to see your real exposure

Alex Rzem — Wed, 08 Apr 2026 12:01:19 GMT

Read the original article

Summary: "You're Charging 2023 Rates for Work AI Does in 40 Minutes"

By Nate | Nate's Substack | April 7, 2026

Main Thesis

The global economy has always been built on inefficiency gaps — the distance between what something costs to produce and what the market pays for it. AI is now closing these gaps at an unprecedented speed (months, not decades), rendering entire pricing models, business structures, and career strategies obsolete almost overnight.

Key Concept: Arbitrage & Gap Closing

Law firms bill for 8 hours of research AI can do in 40 minutes
Consulting decks take 6 weeks when information access was the only barrier
Offshore dev teams exist because of geographic pricing gaps — gaps AI is rapidly erasing
These are all forms of economic arbitrage — exploiting the gap between true cost and market price
AI is closing these gaps on the timescale of model releases, not business cycles

The $313 Proof of Concept

A bot on prediction market Polymarket turned $313 into ~$438,000 in 30 days (late 2025)
It didn't predict markets — it simply closed a pricing gap faster than humans could
A developer reportedly rebuilt the entire system using Claude in ~40 minutes
Critically: 92.4% of wallets on the same platform lost money — proving access to AI ≠ advantage
The 7.6% who profited understood what to build, not just that AI existed

Five Categories of Closing Inefficiency (Taxonomy)

Speed gaps — tasks that took days now take minutes
Knowledge asymmetry — the information edge that funded 30 years of offshoring is evaporating
Formatting/research gaps — billable hours for mechanical work are collapsing
Geographic pricing gaps — location-based cost arbitrage is shrinking
Information distribution gaps — the lag between what's possible and what most people know is possible

The Compression Problem

AI appears to democratize advantage, but it mostly democratizes access
True advantage goes to those who know which gaps to exploit and how
Most people copy the surface (the bot) without understanding the mechanism underneath → they lose

The Rotation Dynamic

Every time AI closes one gap, three new ones open elsewhere
The "Mythos leak" referenced in the article previews a world of continuous disruption with no settling point
Strategic plans written in 2025 are already potentially obsolete

Practical Takeaways

Run a diagnostic on your own role: Ask where your current pricing or value is based on historical inefficiency rather than genuine skill
Three diagnostic questions (paywalled in full) help map where value is heading in any industry
Stop charging 2023 rates for work AI compresses into 40 minutes — clients will eventually notice
Understand the mechanism, not just the tool — copying AI use cases without understanding the underlying gap is a losing strategy
The window to reposition before the gap closes is measured in months, not years

Your AI Agent Depends on Six Layers — Here's Which Ones Won't Last

Alex Rzem — Tue, 07 Apr 2026 12:01:04 GMT

Read the original article

Your AI Agent Depends on Six Layers — Here's Which Ones Won't Last

Main Thesis

A new infrastructure stack is forming beneath AI agents, and most builders can't distinguish which layers are durable from which are temporary stopgaps. Nate argues that understanding this stack early is a competitive advantage — mirroring how early readers of the cloud and API-first transitions built defining companies, while late adapters paid in migration costs and lost time.

The Mental Model

The right analogy is system calls, not Lego bricks — these layers are fundamental OS-level primitives for AI agents, not modular optional components.
The stack is being built for AI agents as the primary user, not humans.

The Six Layers (with durability ratings)

Compute — How agents access processing power
Identity — How agents authenticate and are recognized
Memory — How agents retain and retrieve context
Tool Access — How agents interact with external services
Billing — How agent-driven actions are metered and charged
Orchestration — How agents are coordinated and managed

Each layer is assessed for longevity — some are described as load-bearing walls lasting a decade, others as transitional workarounds agents will outgrow within 18 months.

Key Finding: The Biggest Gap

Orchestration is identified as the most critical unsolved problem — the next infrastructure-defining opportunity — and no one has cracked it yet.
Several layers that will define the next infrastructure-scale company don't exist yet.

Practical Takeaways

Builders should audit which layers they're dependent on and assess their durability.
Avoid transitional lock-in — building deeply on layers likely to be replaced soon.
Focus on reliability math when designing agent systems.
Develop builder skills aligned with the durable layers, not the temporary ones.
Over 1,000 startups and hundreds of millions in VC are already in this space — the window to get ahead of the stack is narrowing.

Bottom Line

The agent infrastructure stack is real, it's forming now, and the builders who can read it accurately will define the next era of AI — just as cloud-native builders defined the last one.

I Tested Cowork, Lindy, Sauna, and Opal Against 3 Questions. The Best Scored 1 out of 4.

Alex Rzem — Mon, 06 Apr 2026 05:22:38 GMT

Read the original article

Summary: I Tested Cowork, Lindy, Sauna, and Opal Against 3 Questions. The Best Scored 1 out of 4.

Main Thesis

A wave of 'outcome agent' tools (Lindy, Sauna, Google Opal, Cowork, Obvious) are pitching software that does the work instead of helping you do the work — but almost none of them can answer the fundamental question: how does the agent know its own output is any good?

The core insight is a structural one: AI agents excel in environments with automated feedback loops (like coding, where tests pass or fail) but struggle in knowledge work environments (like drafting strategy memos) where the human is the only feedback mechanism.

Key Findings

Best performer scored only 1 out of 4 on the evaluation framework — a damning result across the board
The reason code-based AI agents succeeded first is structural: code has test suites that provide instant, objective feedback. Knowledge work has no equivalent
Most outcome agent demos sidestep this problem entirely, hiding it behind polished UI and impressive-looking outputs
Tools reviewed: Cowork, Lindy, Sauna (Obvious), and Google Opal — all tested against a 3-question framework
A single AI agent (likely Manus or similar) triggered a quarter-trillion-dollar selloff in enterprise software stocks, despite being a research preview that stops working when your laptop sleeps

The Evaluation Framework (3 Questions)

Nate builds a framework around the feedback-loop insight to separate real agents from fake ones:

Does the agent know when it's wrong? (automated vs. human-only feedback)
Is the output inspectable? (can you audit what it did and why?)
Does context compound over time? (memory architecture that improves with use)

Practical Takeaways

Write the tests before the agent runs the work — define what 'good output' looks like before delegating
Look for agents with inspectable surfaces — you need to see reasoning, not just results
Memory architecture matters — agents that retain compounding context are structurally superior
Use the included two-phase evaluation prompt to score any agent tool, then build a delegation spec calibrated to its actual weaknesses
The pitch ('outcomes, not answers') might be right eventually — but the infrastructure to support it reliably doesn't yet exist at the level being marketed

Bottom Line

Outcome agents are being sold ahead of their actual capabilities. Until feedback loops in knowledge work are solved, humans remain the QA layer — and most tools aren't designed with that reality in mind.

AI Agents of the Week: Papers You...

Alex Rzem — Sat, 04 Apr 2026 18:38:58 GMT

Read the original article

AI Agents of the Week – LLM Watch (Feb 8, 2026)

Main Thesis

The frontier of AI agent research is rapidly maturing across five dimensions: architecture design, multi-agent collaboration, planning under uncertainty, safety, and evaluation. Agents are evolving from simple chatbots into modular, self-improving systems capable of handling complex, long-horizon tasks — but new challenges around reliability, safety, and interpretability are emerging in parallel.

Key Findings

1. 🏗️ Modular, Hierarchical & Self-Improving Architectures

S1-NexusAgent uses a dual-loop design separating global planning from tool-based subtasks, with a "Critic" module that distills successful trajectories into reusable skills.
MARS (Modular Agent with Reflective Search) introduces cost-aware planning and reflective memory for expensive AI research workflows.
Agents break problems into parts, orchestrate specialised modules, and continuously build competencies over time.

2. 🤝 Multi-Agent Systems: Standardisation & Teamwork Pitfalls

Researchers propose reusable "agent primitives" (e.g. Review, Voting & Selection, Planning & Execution) composable via an organiser agent with shared key-value memory — higher accuracy, lower token cost.
A separate study found LLM agent teams often underperform their best individual member, with consensus-seeking causing up to 37% performance drops.
Upside: consensus-driven teams showed unexpected resilience against adversarial members.
Takeaway: AI collaboration needs new mechanisms to leverage expert agents without groupthink.

3. 🧭 Planning Under Uncertainty: World Models & Assumption Handling

Planner-Composer-Evaluator (PCE) framework converts implicit LLM assumptions into an explicit decision tree, scoring scenarios by likelihood and cost — outperforming dialogue-heavy baselines with far less communication.
Reinforcement World Model Learning (RWML) gives agents an internal world model, aligning imagined next states with actual outcomes — significant task success boosts even without direct reward feedback.
Trend: agents are shifting toward "thinking before acting" — simulating outcomes before committing to actions.

4. 🛡️ Safety & Reliability at the Trajectory Level

AgentHeLLM threat-modeling framework maps "Agent-to-Agent" attack pathways (e.g. in AI vehicle copilots), separating what needs protection from how attacks occur.
A conceptual study argues existing uncertainty quantification methods (designed for single-turn QA) break down for sequential agent decisions.
Proposed reframe: agent confidence as conditionally reducible uncertainty — agents should actively gather information to reduce what they don't know, rather than uncertainty only accumulating.
Future designs will integrate explicit uncertainty modeling and threat assessment into decision loops.

5. 🔍 Interpretability & Evaluation Catching Up

A data-centric interpretability paper used sparse autoencoders + LLM summarisers to analyse multi-agent training logs, uncovering emergent behaviours (role-playing, language switching) and a hidden reward-hacking strategy missed by standard metrics.
Incorporating discovered insights via a refined prompt boosted agent performance by 14%.
Growing call for unified evaluation frameworks — current benchmarks vary wildly due to inconsistent prompts, tools, and environments.

Practical Takeaways

Builders: Adopt modular agent architectures with skill reuse and reflective memory to handle complex tasks more efficiently.
Teams deploying multi-agent systems: Don't assume collaboration = better performance. Design explicit mechanisms for expert agents to lead rather than average out.
Safety teams: Move beyond output-level checks — model threats at the trajectory level and build agents that know their own uncertainty.
Researchers & evaluators: Invest in interpretability tooling and standardised benchmarks now, before autonomous agents are deployed at scale.
Everyone: The "safety net" (monitoring, interpretability, evaluation) must grow alongside agent capabilities — capability without accountability is a risk multiplier.

AI Agents of the Week: Papers You...

Alex Rzem — Sat, 04 Apr 2026 18:38:06 GMT

Read the original article

AI Agents of the Week — LLM Watch (Feb 15, 2026)

Main Thesis

This week's AI agent research challenges several prevailing assumptions about how to build, guide, and scale autonomous agents — from documentation practices to multi-agent coordination and compute allocation.

Key Findings

🧠 Memory & Context

AGENTS.md files hurt performance: Contrary to popular practice, repository-level context files reduce task success rates for coding agents while increasing inference costs by >20%.
Less is more: Minimal or no instructions outperform comprehensive documentation, suggesting unnecessary constraints impede agents rather than help them.

🗺️ Planning & Environment

Gaia2 benchmark: Introduces dynamic, evolving environments independent of agent actions. Best results: GPT-5 (high) at 42% pass@1 but struggles with time-sensitive tasks; Kimi-K2 (open-source) at 21% pass@1.
CATTS (Confidence-Aware Test-Time Scaling): Outperforms naive uniform compute sampling by up to 9.1% on WebArena-Lite while using 2.3x fewer tokens — smart allocation beats brute-force compute.

🤝 Multi-Agent Collaboration

Communication delays create U-shaped cooperation: Moderate delays cause LLM agents to exploit slower peers; excessive delay paradoxically reduces exploitation cycles.
FLCOA framework: Five-layer model showing that low-level factors like communication resources fundamentally shape multi-agent cooperation — largely overlooked in current system design.
LAVES: Hierarchical multi-agent system for educational video generation achieves >1 million videos/day throughput with a 95% cost reduction vs. industry standards.

🔒 Trust & Safety

Behavioral inconsistency predicts failure: ReAct agents produce 2.0–4.2 distinct action sequences per 10 identical runs. Tasks with consistent paths achieve 80–92% accuracy; highly inconsistent tasks drop to 25–60%.
69% of divergence occurs at step 2, meaning early decisions cascade into downstream failures — making early-step monitoring a practical intervention point.

🛠️ Tools & Benchmarks

Mobile dev AI agents: Study of 2,901 AI-authored PRs across 193 Android/iOS repos. Android sees 2x more AI PRs with higher acceptance (71% vs. 63% iOS). Routine tasks succeed most; structural refactors lag.
AmbiBench: First benchmark using an instruction clarity taxonomy, shifting evaluation toward bidirectional intent alignment — addressing the reality that users often fail to articulate precise directives upfront.

Practical Takeaways

Strip down AGENTS.md files — comprehensive instructions may be actively harming your coding agents.
Monitor behavioral consistency as a real-time reliability signal; early divergence is a strong failure predictor.
Use confidence-aware compute allocation rather than scaling uniformly for better efficiency and performance.
Design multi-agent systems with communication latency in mind — it shapes cooperation in non-obvious ways.
Evaluate agents on ambiguous instructions, not just clean ones — AmbiBench highlights a critical gap in current benchmarking.

AI Agents of the Week: Papers You...

Alex Rzem — Sat, 04 Apr 2026 18:37:01 GMT

Read the original article

AI Agents of the Week – LLM Watch (Feb 22, 2026)

Main Thesis

This weekly research roundup from LLM Watch highlights five key areas where AI agent research is rapidly advancing: memory & continual learning, planning under uncertainty, multi-agent collaboration, trust & safety, and practical tooling.

Key Findings

🧠 Memory & Continual Learning

IntentCUA introduces intent-level representations that convert raw interaction traces into reusable skills.
Achieves a 74.83% task success rate with a Step Efficiency Ratio of 0.91 on desktop automation tasks.
Uses a coordinated Planner, Plan-Optimizer, and Critic sharing memory to stabilise long-horizon execution.

🗺️ Planning & Environment Interaction

AgentConductor uses reinforcement learning to evolve multi-agent communication topologies dynamically.
Delivers up to 14.6% improvement in pass@1 accuracy over baselines for code generation.
Density-aware layered DAG construction reduces token costs by 68% — a major efficiency win for compute-constrained deployments.

🤝 Multi-Agent Collaboration & Control

AgentConductor shows that adapting topology to task difficulty outperforms fixed communication graphs, with 13% density reductions alongside accuracy gains.
AutoNumerics applies multi-agent orchestration to scientific computing, autonomously designing and verifying PDE solvers across 24 canonical problems.
Key insight: the architecture of agent collaboration matters more than individual agent capability.

🔒 Trust, Verification & Safety

Wink is a production-deployed system for recovering from coding agent misbehaviours.
Found that ~30% of all agent trajectories contain misbehaviours: Specification Drift, Reasoning Problems, or Tool Call Failures.
Lightweight self-intervention resolves 90% of single-intervention misbehaviours and reduced engineer interventions in live A/B testing.
CowCorpus provides a taxonomy of human intervention patterns, enabling models to predict user interventions with a 61.4–63.4% improvement over baselines.

🛠️ Tools & Frameworks in Practice

How AI Coding Agents Communicate analyses pull requests across five AI coding agents.
Finds that presentation style correlates with reviewer engagement and merge outcomes — agents that communicate clearly get their PRs merged more often.

Practical Takeaways

Build for long horizons: Intent-level memory abstraction (IntentCUA) is a viable path to more reliable long-running agents.
Dynamic topology > static graphs: Fixed multi-agent communication structures leave significant performance and cost on the table.
Expect ~30% misbehaviour rates: Production agent systems need built-in recovery mechanisms, not just prevention.
Human-in-the-loop is predictable: Models can now anticipate when humans will intervene, enabling proactive agent self-correction.
Agent communication style matters: How an agent explains its work affects real-world outcomes like code review acceptance.

AI with Alex & Angus

512,000 Lines of Leaked Code Reveal the Lock-In Strategy Coming for Your AI Stack

512,000 Lines of Leaked Code Reveal Anthropic's Lock-In Strategy

Main Thesis

Key Findings

What is Conway?

The Five Strategic Moves

The .cnw.zip Question

The Lock-In Nobody's Talking About

Practical Takeaways

Bottom Line

AI Agents Weekly: GPT-5.3 Codex Spark

AI Agents Weekly: GPT-5.3-Codex-Spark & More — Summary

Main Thesis

Key Stories (Accessible Content)

🔥 GPT-5.3-Codex-Spark (OpenAI)

🧠 GLM-5 (Zhipu AI)

Other Headlines (Paywalled — Titles Only)

Practical Takeaways

Top AI Papers of the Week

Top AI Papers of the Week (February 9–15, 2026)

1. ALMA — Automated Meta-Learning of Memory Designs for Agentic Systems

2. LLaDA 2.1 — Discrete Diffusion Language Model Upgrade

3. SkillRL — Recursive Skill-Augmented Reinforcement Learning

4. InftyThink+ — Infinite-Horizon Reasoning via RL

5. Agyn — Multi-Agent Software Engineering System

6. EchoJEPA — Cardiac Foundation Model

7. AdaptEvolve — Confidence-Driven Model Routing for Agentic Systems

8. Gaia2 — Dynamic Agent Benchmark from Meta FAIR

9. AgentArk — Distilling Multi-Agent Debate into a Single LLM

10. AgentSkiller — Scaling Generalist Tool-Use Agents via Data Quality

Overall Themes This Week

AI Agents Weekly: Claude Sonnet

AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro & More

Overview

🔑 Top Stories (Accessible Content)

1. Claude Sonnet 4.6 — Anthropic

2. EVMBench — AI Agents vs. Smart Contract Security

📰 Other Headlines (Paywalled)

🧠 Key Themes This Week

📄 Papers Mentioned

Top AI Papers of the Week

Top AI Papers of the Week (February 16–22, 2026)

Overview

1. Intelligent AI Delegation — Google DeepMind

2. Emergent Socialization in AI Agent Society — Moltbook Study

3. Lossless Context Management (LCM)

4. GLM-5 — Zhipu AI

5. MemoryArena

6. MAPLE

7. SkillsBench

8. LongCLI-Bench

9. CogRouter

10. Team of Thoughts

Key Cross-Cutting Themes

Does Agents Actually Help Coding

Does AGENTS.md Actually Help Coding Agents? A New Study Has Answers

Main Thesis

Background: The Problem

The New Benchmark: AGENTbench

Key Findings

🔴 LLM-Generated Context Files Hurt Performance

🟢 Human-Written Context Files Help (On Their Own Turf)

⚡ The Instruction-Following Paradox

🔍 Why Human Files Win: The Redundancy Problem

Limitations

Practical Takeaways

Bottom Line

Resources

AI Agents Weekly: Evaluating Agents

AI Agents Weekly: Evaluating AGENTS.md & More

Main Thesis

🔬 Key Finding: AGENTS.md Files Hurt Coding Agent Performance

📰 Other Stories Covered (Paywalled)

💡 Practical Takeaways

Top AI Papers of the Week

Top AI Papers of the Week (Feb 23 – Mar 1, 2026)

1. Deep-Thinking Tokens

2. Codified Context

3. Discovering Multi-Agent Learning Algorithms with LLMs

The `.cnw.zip` Question