<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[AI with Alex & Angus]]></title><description><![CDATA[This blog started as a way to keep up with a fast-moving field without losing my mind. Now Angus (my AI collaborator) does the reading, we do the distilling together, and you get the good stuff.]]></description><link>https://rzem.guru</link><image><url>https://cdn.hashnode.com/uploads/logos/68f98b977a2367a3b72e817c/b2e619d7-9ece-44f8-aeba-39b05532a27c.jpg</url><title>AI with Alex &amp; Angus</title><link>https://rzem.guru</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 17:06:34 GMT</lastBuildDate><atom:link href="https://rzem.guru/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[512,000 Lines of Leaked Code Reveal the Lock-In Strategy Coming for Your AI Stack]]></title><description><![CDATA[Read the original article
512,000 Lines of Leaked Code Reveal Anthropic's Lock-In Strategy
Main Thesis
Anthropic accidentally published ~500,000 lines of Claude Code source code via a packaging error. Buried within it is evidence of an unannounced al...]]></description><link>https://rzem.guru/512000-lines-of-leaked-code-reveal-the-lock-in-strategy-coming-for-your-ai-stack</link><guid isPermaLink="true">https://rzem.guru/512000-lines-of-leaked-code-reveal-the-lock-in-strategy-coming-for-your-ai-stack</guid><category><![CDATA[AI]]></category><category><![CDATA[Developer Tools]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 12:03:26 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a958ee3/8FeFw4j8oq4y_NA3JODA7_yIGxWFn1.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://natesnewsletter.substack.com/p/the-platform-play-hidden-in-512000">Read the original article</a></p>
<h2 id="heading-512000-lines-of-leaked-code-reveal-anthropics-lock-in-strategy">512,000 Lines of Leaked Code Reveal Anthropic's Lock-In Strategy</h2>
<h3 id="heading-main-thesis">Main Thesis</h3>
<p>Anthropic accidentally published ~500,000 lines of Claude Code source code via a packaging error. Buried within it is evidence of an unannounced always-on agent called <strong>Conway</strong> — and when combined with Anthropic's recent product moves, it reveals a deliberate <strong>platform lock-in strategy</strong> comparable to historical tech monopoly plays.</p>
<hr />
<h3 id="heading-key-findings">Key Findings</h3>
<h4 id="heading-what-is-conway">What is Conway?</h4>
<ul>
<li>A <strong>standalone agent environment</strong> (separate from Claude chat)</li>
<li>Always-on: can be <strong>woken by external events</strong></li>
<li>Has <strong>browser control</strong> and integrations with third-party tools</li>
<li>Supports its own <strong>proprietary extension format</strong> (<code>.cnw.zip</code>)</li>
<li>Not publicly announced — discovered only through the leak</li>
</ul>
<h4 id="heading-the-five-strategic-moves">The Five Strategic Moves</h4>
<p>Nate connects Conway to five other Anthropic initiatives as a unified platform play:</p>
<ol>
<li><strong>Claude Code Channels</strong> — deepening developer workflow integration</li>
<li><strong>Cowork</strong> — collaborative agent environments</li>
<li><strong>The Marketplace</strong> — ecosystem of tools/extensions</li>
<li><strong>The Partner Network</strong> — third-party lock-in via certified integrations</li>
<li><strong>The OpenClaw ban</strong> — controlling what agents can connect to</li>
</ol>
<h4 id="heading-the-cnwzip-question">The <code>.cnw.zip</code> Question</h4>
<ul>
<li>Conway's proprietary extension format sits <strong>on top of MCP</strong> (Model Context Protocol)</li>
<li>Nate compares this to the <strong>Google Play Services playbook</strong>: open standard underneath, proprietary layer on top that becomes the real dependency</li>
<li>Tool builders targeting Conway's format become dependent on Anthropic's ecosystem</li>
</ul>
<h4 id="heading-the-lock-in-nobodys-talking-about">The Lock-In Nobody's Talking About</h4>
<ul>
<li>An always-on agent that learns your workflows, preferences, and organizational context builds <strong>behavioral memory</strong></li>
<li>This creates switching costs <strong>deeper than anything Microsoft or Salesforce built</strong> — because it's not just data, it's <em>learned context</em> about how your team thinks and operates</li>
<li>Moving away means losing an AI that has internalized your organization</li>
</ul>
<hr />
<h3 id="heading-practical-takeaways">Practical Takeaways</h3>
<ul>
<li><strong>Map your platform dependencies</strong> before Conway-style agents become default infrastructure</li>
<li><strong>Negotiate portability clauses</strong> in enterprise AI contracts now, before lock-in is established</li>
<li><strong>Choose your agent memory architecture deliberately</strong> — don't let vendor defaults make that decision for you</li>
<li>Nate provides <strong>three prompts</strong> to help teams action each of these steps</li>
<li>The historical parallel: companies that ignored similar platform consolidation moves in prior tech cycles paid dearly — treat this as an early warning signal</li>
</ul>
<hr />
<h3 id="heading-bottom-line">Bottom Line</h3>
<p>Conway isn't just a product feature — it's Anthropic's bid to become the <strong>operating system layer for enterprise AI</strong>. The leak revealed the strategy before the announcement. Teams deploying AI at scale should be paying close attention now.</p>
<p><img src="https://v3b.fal.media/files/b/0a958ee3/Ovj6DqJfkPLEsM1pePiKK_TJwi2yLz.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a958ee3/8FeFw4j8oq4y_NA3JODA7_yIGxWFn1.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents Weekly: GPT-5.3 Codex Spark]]></title><description><![CDATA[Read the original article
AI Agents Weekly: GPT-5.3-Codex-Spark & More — Summary
From Elvis Saravia's AI Newsletter, February 14, 2026

Main Thesis
This issue covers a packed week in AI agents and frontier models, headlined by OpenAI's new agentic co...]]></description><link>https://rzem.guru/ai-agents-weekly-gpt-53-codex-spark-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-weekly-gpt-53-codex-spark-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:28:52 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a95871a/mY_AgKJEQP8vrvMMfea87_0gXWUbb9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/ai-agents-weekly-gpt-53-codex-spark">Read the original article</a></p>
<h1 id="heading-ai-agents-weekly-gpt-53-codex-spark-amp-more-summary">AI Agents Weekly: GPT-5.3-Codex-Spark &amp; More — Summary</h1>
<p><em>From Elvis Saravia's AI Newsletter, February 14, 2026</em></p>
<hr />
<h2 id="heading-main-thesis">Main Thesis</h2>
<p>This issue covers a packed week in AI agents and frontier models, headlined by OpenAI's new agentic coding model, Zhipu AI's open-source powerhouse, and a wave of breakthroughs across autonomous systems, benchmarks, and developer tooling.</p>
<hr />
<h2 id="heading-key-stories-accessible-content">Key Stories (Accessible Content)</h2>
<h3 id="heading-gpt-53-codex-spark-openai">🔥 GPT-5.3-Codex-Spark (OpenAI)</h3>
<ul>
<li>OpenAI's <strong>most capable agentic coding model</strong> to date, running <strong>25% faster</strong> than its predecessor.</li>
<li><strong>Self-developing:</strong> Early versions of GPT-5.3 were used to debug its own training, manage deployment, and interpret evaluation results — making it the first OpenAI model instrumental in its own creation.</li>
<li><strong>Beyond coding:</strong> Handles professional knowledge-work outputs including presentations, spreadsheets, and documentation. Wins or ties <strong>70.9%</strong> of evaluations on the GDPval knowledge-work benchmark.</li>
<li><strong>Cybersecurity flag:</strong> First OpenAI model to hit <strong>"high" cybersecurity capability</strong> under their Preparedness Framework — meaning it could meaningfully enable real-world cyber harm if misused. OpenAI responded by announcing a <strong>$10M API credits program</strong> for cyber defense research.</li>
</ul>
<hr />
<h3 id="heading-glm-5-zhipu-ai">🧠 GLM-5 (Zhipu AI)</h3>
<ul>
<li>A massive <strong>744B-parameter Mixture-of-Experts (MoE)</strong> model with <strong>40B active parameters</strong>, built specifically for agentic intelligence and multi-step reasoning.</li>
<li><strong>Hardware independence:</strong> Trained entirely on <strong>Huawei Ascend chips</strong> using the <strong>MindSpore framework</strong> — no US-manufactured semiconductors involved.</li>
<li><strong>Agent Mode:</strong> Native autonomous task decomposition, breaking high-level goals into subtasks with minimal human input. Can convert raw prompts into polished <code>.docx</code>, <code>.pdf</code>, and <code>.xlsx</code> documents.</li>
<li><strong>Training scale:</strong> Pre-trained on <strong>28.5 trillion tokens</strong> (a 23.9% increase over GLM-4.7). Uses a novel RL technique achieving record-low hallucination rates.</li>
<li><strong>Open &amp; affordable:</strong> Released under <strong>MIT license</strong> with open weights. Available on OpenRouter at ~<strong>$0.80/M input tokens</strong> and <strong>$2.56/M output tokens</strong> — roughly <strong>6× cheaper</strong> than comparable proprietary models.</li>
</ul>
<hr />
<h2 id="heading-other-headlines-paywalled-titles-only">Other Headlines (Paywalled — Titles Only)</h2>
<ul>
<li><strong>MiniMax M2.5</strong> — New open-source model drop</li>
<li><strong>Recursive Language Models</strong> — Replacing context stuffing</li>
<li><strong>OpenAI ships 1M lines</strong> with zero manual code</li>
<li><strong>Agentica</strong> pushes ARC-AGI-2 with recursive agents</li>
<li><strong>Chrome WebMCP</strong> early preview launched</li>
<li><strong>Anthropic raises $30B</strong> at a $380B valuation</li>
<li><strong>Excalidraw</strong> launches official MCP server</li>
<li><strong>Hive agent framework</strong> evolves at runtime</li>
<li><strong>Waymo</strong> begins 6th-gen autonomous operations</li>
<li><strong>Gemini 3 Deep Think</strong> solves 18 open mathematical problems</li>
</ul>
<hr />
<h2 id="heading-practical-takeaways">Practical Takeaways</h2>
<ol>
<li><strong>Agentic coding is maturing fast</strong> — GPT-5.3-Codex-Spark sets a new bar for autonomous software development, including self-referential model improvement.</li>
<li><strong>Open-source is competitive</strong> — GLM-5 challenges proprietary frontier models at a fraction of the cost, with full hardware sovereignty.</li>
<li><strong>Cybersecurity risk is real</strong> — As models hit "high" capability thresholds, responsible deployment frameworks and defense investment are becoming non-negotiable.</li>
<li><strong>Agent infrastructure is exploding</strong> — MCP servers, agentic frameworks, and recursive agent architectures are rapidly becoming standard developer tooling.</li>
<li><strong>Hardware geopolitics matter</strong> — GLM-5's Huawei Ascend training stack signals a maturing alternative AI hardware ecosystem outside US supply chains.</li>
</ol>
<hr />
<p><em>Note: No arXiv papers were linked or cited in the accessible portion of this article.</em></p>
<p><img src="https://v3b.fal.media/files/b/0a95871a/wmMAQnbjFKr9s93Co1Nua_UgzzJLjS.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a95871a/mY_AgKJEQP8vrvMMfea87_0gXWUbb9.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Top AI Papers of the Week]]></title><description><![CDATA[Read the original article
Top AI Papers of the Week (February 9–15, 2026)
From Elvis Saravia's AI Newsletter
This week's roundup covers ten significant AI research papers spanning agentic memory design, diffusion language models, reinforcement learni...]]></description><link>https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:27:59 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a958715/aN3PBUNsOQ9xMEiSe5mND_bdHRQfYJ.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-544">Read the original article</a></p>
<h1 id="heading-top-ai-papers-of-the-week-february-915-2026">Top AI Papers of the Week (February 9–15, 2026)</h1>
<p><em>From Elvis Saravia's AI Newsletter</em></p>
<p>This week's roundup covers ten significant AI research papers spanning agentic memory design, diffusion language models, reinforcement learning, medical AI, and multi-agent benchmarking.</p>
<hr />
<h2 id="heading-1-alma-automated-meta-learning-of-memory-designs-for-agentic-systems">1. ALMA — Automated Meta-Learning of Memory Designs for Agentic Systems</h2>
<p><strong>Main Thesis:</strong> Instead of hand-engineering memory modules for AI agents, ALMA uses a Meta Agent to automatically discover optimal memory architectures through open-ended exploration in code space.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Searches over database schemas, retrieval mechanisms, and update strategies as executable code</li>
<li>Discovers domain-specific memory structures: affordance graphs (ALFWorld), task signature databases (TextWorld), strategy libraries (Baba Is AI), risk-interaction schemas (MiniHack)</li>
<li>Achieves 12.3% avg success with GPT-5-nano vs 8.6% for best human baseline; 53.9% with GPT-5-mini vs 48.6%</li>
<li>Designs scale better with experience and transfer across foundation models</li>
</ul>
<p><strong>Practical Takeaway:</strong> Memory design for agents can be automated — no more hand-crafted modules needed. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-2-llada-21-discrete-diffusion-language-model-upgrade">2. LLaDA 2.1 — Discrete Diffusion Language Model Upgrade</h2>
<p><strong>Main Thesis:</strong> Ant Group's LLaDA 2.1 breaks the speed-quality trade-off in diffusion LLMs via Token-to-Token (T2T) editing and the first large-scale RL framework for diffusion models.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>T2T editing allows correction of already-generated tokens, not just unmasking</li>
<li>Two modes: <strong>Speedy Mode</strong> (max throughput) and <strong>Quality Mode</strong> (benchmark accuracy)</li>
<li>LLaDA 2.1-Flash (100B) hits 892 tokens/sec on HumanEval+; Mini (16B) peaks at 1,587 TPS</li>
<li>Introduces <strong>EBPO</strong> (Evidence-Based Policy Optimization) for stable RL training across 33 benchmarks</li>
</ul>
<p><strong>Practical Takeaway:</strong> Diffusion LLMs can now rival autoregressive models on both speed and quality with a configurable trade-off knob. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-3-skillrl-recursive-skill-augmented-reinforcement-learning">3. SkillRL — Recursive Skill-Augmented Reinforcement Learning</h2>
<p><strong>Main Thesis:</strong> SkillRL bridges raw experience and policy improvement by distilling trajectories into reusable high-level behavioral skills that co-evolve with the agent policy.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Hierarchical <strong>SkillBank</strong> extracts reusable patterns from raw trajectories, reducing token footprint</li>
<li>Dual retrieval strategy combines general heuristics with task-specific skills</li>
<li>Recursive co-evolution: better skills → better performance → better training data</li>
<li><strong>89.9%</strong> success on ALFWorld, <strong>72.7%</strong> on WebShop, <strong>47.1%</strong> avg on search-augmented QA — outperforming baselines by over 15.3%</li>
</ul>
<p><strong>Practical Takeaway:</strong> Storing distilled skills rather than raw trajectories dramatically improves agent efficiency and scalability. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-4-inftythink-infinite-horizon-reasoning-via-rl">4. InftyThink+ — Infinite-Horizon Reasoning via RL</h2>
<p><strong>Main Thesis:</strong> InftyThink+ solves the quadratic cost, context length, and lost-in-the-middle problems of long chain-of-thought reasoning by training models to autonomously segment, summarize, and resume reasoning.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Decomposes reasoning into iterations connected by self-generated summaries</li>
<li>Two-stage training: supervised cold-start on format → trajectory-level GRPO optimization</li>
<li><strong>21-point accuracy gain</strong> on AIME24 (29.5% → 50.9%) vs vanilla long-CoT RL (38.8%)</li>
<li>Adding an efficiency reward cuts token usage by <strong>50%</strong> with modest accuracy trade-off</li>
<li>Generalizes to GPQA Diamond and AIME25</li>
</ul>
<p><strong>Practical Takeaway:</strong> Teaching models when and how to summarize mid-reasoning dramatically improves both accuracy and inference speed. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-5-agyn-multi-agent-software-engineering-system">5. Agyn — Multi-Agent Software Engineering System</h2>
<p><strong>Main Thesis:</strong> Agyn models software engineering as an organizational process with specialized agents in distinct roles, achieving strong results without SWE-bench tuning.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Four agents: manager, researcher, engineer, reviewer — each with role-specific tools and models</li>
<li>Reasoning-heavy roles use larger models; implementation roles use smaller, code-specialized models</li>
<li>Dynamic workflow: manager decides iteration cycles based on intermediate outcomes</li>
<li><strong>72.2%</strong> task resolution on SWE-bench 500, outperforming single-agent baselines by <strong>7.4%</strong></li>
</ul>
<p><strong>Practical Takeaway:</strong> Organizational design and agent infrastructure may matter as much as model quality for autonomous software engineering. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-6-echojepa-cardiac-foundation-model">6. EchoJEPA — Cardiac Foundation Model</h2>
<p><strong>Main Thesis:</strong> EchoJEPA is a JEPA-style foundation model trained on 18 million echocardiograms that learns clinically meaningful cardiac representations by predicting in latent space rather than pixel space.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Trained on 18M echos from 300K patients; ignores speckle noise and acoustic artifacts</li>
<li><strong>~20% improvement</strong> in left ventricular ejection fraction estimation; <strong>~17%</strong> in right ventricular systolic pressure estimation</li>
<li><strong>79% view classification accuracy</strong> with only 1% labeled data (best baseline: 42% with full data)</li>
<li>Only <strong>2% degradation</strong> under acoustic perturbations vs 17% for competitors</li>
<li>Zero-shot performance on pediatric patients exceeds fine-tuned baselines</li>
</ul>
<p><strong>Practical Takeaway:</strong> Latent-space predictive learning at scale produces robust, label-efficient cardiac AI that generalizes across patient populations. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-7-adaptevolve-confidence-driven-model-routing-for-agentic-systems">7. AdaptEvolve — Confidence-Driven Model Routing for Agentic Systems</h2>
<p><strong>Main Thesis:</strong> AdaptEvolve reduces the cost of iterative LLM-based refinement loops by dynamically routing easy sub-problems to smaller models and hard decisions to frontier models based on intrinsic generation confidence.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Monitors real-time generation confidence scores — no external controller needed</li>
<li>Cuts inference costs by <strong>~38%</strong> while retaining <strong>~97.5%</strong> of upper-bound accuracy</li>
<li>Model-agnostic and requires no task-specific tuning</li>
<li>Makes evolutionary agent workflows viable for production deployment</li>
</ul>
<p><strong>Practical Takeaway:</strong> Confidence-based routing is a practical, plug-in efficiency mechanism for any iterative agentic pipeline. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-8-gaia2-dynamic-agent-benchmark-from-meta-fair">8. Gaia2 — Dynamic Agent Benchmark from Meta FAIR</h2>
<p><strong>Main Thesis:</strong> Gaia2 moves beyond static benchmarks by introducing environments that change independently of agent actions, testing temporal pressure, uncertainty, and multi-agent coordination.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>GPT-5 leads at <strong>42% pass@1</strong> but struggles with time-constrained tasks</li>
<li>Kimi-K2 leads open-source models at <strong>21%</strong></li>
<li>Built on open-source Agents Research Environments (ARE) with action-level verifiers</li>
<li>Represents a paradigm shift toward dynamic agentic evaluation</li>
</ul>
<p><strong>Practical Takeaway:</strong> Current frontier models still struggle significantly with dynamic, time-pressured environments — a major open research challenge. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-9-agentark-distilling-multi-agent-debate-into-a-single-llm">9. AgentArk — Distilling Multi-Agent Debate into a Single LLM</h2>
<p><strong>Main Thesis:</strong> AgentArk transfers the reasoning and self-correction abilities of multi-agent debate systems into a single model at training time, achieving near-multi-agent performance at a fraction of the cost.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Three distillation strategies: reasoning-enhanced SFT, trajectory-based augmentation, process-aware distillation with a process reward model</li>
<li>Average <strong>4.8% improvement</strong> over single-agent baselines across math and reasoning benchmarks</li>
<li>Cross-family distillation (e.g., Qwen3-32B → LLaMA-3-8B) yields the largest gains</li>
<li>Approaches full multi-agent performance at single-model inference cost</li>
</ul>
<p><strong>Practical Takeaway:</strong> You don't need to run multiple agents at inference time — their reasoning capabilities can be baked into a single smaller model. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-10-agentskiller-scaling-generalist-tool-use-agents-via-data-quality">10. AgentSkiller — Scaling Generalist Tool-Use Agents via Data Quality</h2>
<p><strong>Main Thesis:</strong> AgentSkiller demonstrates that semantically integrated, high-quality synthetic training data matters more than parameter count for building strong tool-use agents.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li>Produces <strong>11K high-quality synthetic trajectories</strong> across diverse tool-use scenarios</li>
<li>14B model beats GPT-o3 on tau2-bench (<strong>79.1% vs 68.4%</strong>)</li>
<li>4B variant outperforms 70B and 235B models</li>
<li>Semantic integration across domains is the key differentiator</li>
</ul>
<p><strong>Practical Takeaway:</strong> Invest in data quality and semantic diversity — smaller, well-trained models can outperform much larger ones on agentic tool-use tasks. <a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-overall-themes-this-week">Overall Themes This Week</h2>
<ol>
<li><strong>Automation of agent design</strong> — from memory (ALMA) to skills (SkillRL) to multi-agent reasoning (AgentArk)</li>
<li><strong>Efficiency without quality loss</strong> — AdaptEvolve, LLaDA 2.1, and InftyThink+ all offer speed-accuracy knobs</li>
<li><strong>Data quality over scale</strong> — AgentSkiller challenges the parameter-scaling assumption</li>
<li><strong>Medical AI at foundation scale</strong> — EchoJEPA sets a new bar for label-efficient clinical models</li>
<li><strong>Dynamic benchmarking</strong> — Gaia2 pushes evaluation beyond static tasks toward real-world agentic challenges</li>
</ol>
<p><img src="https://v3b.fal.media/files/b/0a958715/vmwhY2Q_gWizalwd3IvOb_mMqg2W5X.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a958715/aN3PBUNsOQ9xMEiSe5mND_bdHRQfYJ.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents Weekly: Claude Sonnet]]></title><description><![CDATA[Read the original article
AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro & More
Overview
Elvis Saravia's AI Agents Weekly newsletter (Feb 21, 2026) covers a packed week of major AI releases and agent-focused developments, highlighting significan...]]></description><link>https://rzem.guru/ai-agents-weekly-claude-sonnet-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-weekly-claude-sonnet-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:26:27 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a95870b/VtCrODVCy9-EE-XeZdMpf_23vNvNMq.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-sonnet-46">Read the original article</a></p>
<h1 id="heading-ai-agents-weekly-claude-sonnet-46-gemini-31-pro-amp-more">AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro &amp; More</h1>
<h2 id="heading-overview">Overview</h2>
<p>Elvis Saravia's AI Agents Weekly newsletter (Feb 21, 2026) covers a packed week of major AI releases and agent-focused developments, highlighting significant leaps in autonomous computer use, coding agents, and AI benchmarking.</p>
<hr />
<h2 id="heading-top-stories-accessible-content">🔑 Top Stories (Accessible Content)</h2>
<h3 id="heading-1-claude-sonnet-46-anthropic">1. Claude Sonnet 4.6 — Anthropic</h3>
<p>Anthropic released <strong>Claude Sonnet 4.6</strong> as the new default model for all Claude users on February 17, 2026.</p>
<ul>
<li><strong>Computer Use Breakthrough:</strong> OSWorld benchmark scores jumped from <strong>14.9% → 72.5%</strong> (nearly <strong>5x improvement</strong>), making it the most capable model for autonomous GUI-based agent workflows.</li>
<li><strong>1M Token Context Window (Beta):</strong> Enables agents to process entire codebases, long documents, and multi-session histories without losing earlier context.</li>
<li><strong>User Preference:</strong> In blind A/B tests, users preferred Sonnet 4.6 over Sonnet 4.5 ~<strong>70% of the time</strong>, particularly for coding, instruction following, and nuanced reasoning.</li>
<li><strong>Pricing:</strong> $3/$15 per million input/output tokens — cost-efficient for high-volume agent deployments.</li>
</ul>
<p><strong>Practical Takeaway:</strong> Sonnet 4.6 is a significant upgrade for anyone building agentic or computer-use workflows — the 5x OSWorld improvement alone makes it a compelling default choice.</p>
<hr />
<h3 id="heading-2-evmbench-ai-agents-vs-smart-contract-security">2. EVMBench — AI Agents vs. Smart Contract Security</h3>
<p>OpenAI and Paradigm introduced <strong>EVMBench</strong>, a benchmark evaluating AI agents on smart contract security tasks across <strong>120 curated vulnerabilities from 40 audits</strong>.</p>
<ul>
<li><strong>Three Tasks:</strong> Detect, patch, and exploit high-severity smart contract vulnerabilities.</li>
<li><strong>Exploit-First Strength:</strong> Agents perform best at exploitation (where the goal is clear — drain funds) but struggle with exhaustive detection and patching tasks.</li>
<li><strong>Real-World Sources:</strong> Scenarios sourced from open code audit competitions and the <strong>Tempo blockchain</strong> security auditing platform (a purpose-built L1 for high-throughput stablecoin payments).</li>
<li><strong>Key Limitation:</strong> Agents often stop after finding a single vulnerability rather than auditing comprehensively — a critical gap for security-critical deployments.</li>
</ul>
<p><strong>Practical Takeaway:</strong> AI agents show promise for exploit discovery but are not yet reliable for full-coverage security auditing. Human oversight remains essential in smart contract security workflows.</p>
<hr />
<h2 id="heading-other-headlines-paywalled">📰 Other Headlines (Paywalled)</h2>
<p>The following stories are referenced but behind the paywall:</p>
<ul>
<li><strong>Gemini 3.1 Pro</strong> — Google launches with 77% ARC-AGI-2 score</li>
<li><strong>Stripe Minions</strong> — Coding agents deployed at scale</li>
<li><strong>Cloudflare Code Mode MCP</strong> — Claims 99.9% token savings</li>
<li><strong>Qwen 3.5</strong> — Alibaba drops agentic vision model</li>
<li><strong>ggml.ai joins Hugging Face</strong> — Local AI integration</li>
<li><strong>Anthropic measures AI agent autonomy in practice</strong></li>
<li><strong>AI agent autonomously publishes a hit piece</strong></li>
<li><strong>dmux</strong> — Multiplexes AI coding agents in parallel</li>
<li><strong>New benchmarks for agent memory and reliability</strong></li>
</ul>
<hr />
<h2 id="heading-key-themes-this-week">🧠 Key Themes This Week</h2>
<ol>
<li><strong>Computer use agents are maturing fast</strong> — Sonnet 4.6's OSWorld leap signals GUI automation is becoming production-ready.</li>
<li><strong>Security + AI agents</strong> — EVMBench highlights both the promise and the gaps in AI-driven smart contract auditing.</li>
<li><strong>Cost-efficiency at scale</strong> — Competitive pricing and token savings (Cloudflare's 99.9% claim) are central themes as agentic deployments scale.</li>
<li><strong>Parallelism &amp; memory</strong> — New tools (dmux) and benchmarks focus on running multiple agents reliably and with better memory.</li>
</ol>
<hr />
<h2 id="heading-papers-mentioned">📄 Papers Mentioned</h2>
<ul>
<li>EVMBench is referenced via a blog post — no direct arXiv link was accessible from the paywalled content.</li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a95870b/rnsjepyf25gcoB-hgTvlY_2QxLnvgv.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a95870b/VtCrODVCy9-EE-XeZdMpf_23vNvNMq.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Top AI Papers of the Week]]></title><description><![CDATA[Read the original article
Top AI Papers of the Week (February 16–22, 2026)
From Elvis Saravia's AI Newsletter

Overview
This week's roundup covers 10 significant AI research papers spanning agent delegation, social dynamics, memory management, person...]]></description><link>https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:25:21 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a958703/lGM3_h7D2IiFVxfib62D8_UYZTcY2X.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-c98">Read the original article</a></p>
<h1 id="heading-top-ai-papers-of-the-week-february-1622-2026">Top AI Papers of the Week (February 16–22, 2026)</h1>
<p><em>From Elvis Saravia's AI Newsletter</em></p>
<hr />
<h2 id="heading-overview">Overview</h2>
<p>This week's roundup covers 10 significant AI research papers spanning agent delegation, social dynamics, memory management, personalization, benchmarking, and reasoning efficiency. A recurring theme is the <strong>gap between what AI systems appear capable of and what they can reliably do in real-world, multi-session, or complex agentic settings</strong>.</p>
<hr />
<h2 id="heading-1-intelligent-ai-delegation-google-deepmind">1. Intelligent AI Delegation — Google DeepMind</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>Google DeepMind introduces a comprehensive framework treating delegation not as a simple task handoff, but as a <strong>sequence of decisions</strong>: whether to delegate, how to instruct, and how to verify and integrate outputs.</p>
<ul>
<li><strong>Adaptive delegation</strong>: Dynamic, real-time adaptation rather than static heuristics, with resilient failure management.</li>
<li><strong>Trust calibration</strong>: Formal trust models accounting for capability uncertainty, task complexity, and historical performance — preventing both over- and under-delegation.</li>
<li><strong>Verification protocols</strong>: Confidence-aware acceptance criteria and fallback mechanisms before AI outputs are integrated.</li>
<li><strong>Multi-agent chains</strong>: Extends to AI-to-AI delegation networks with accountability tracking and authority propagation.</li>
</ul>
<p><strong>Takeaway</strong>: Production AI deployments need structured delegation frameworks — blind trust in agent outputs compounds errors at scale.</p>
<hr />
<h2 id="heading-2-emergent-socialization-in-ai-agent-society-moltbook-study">2. Emergent Socialization in AI Agent Society — Moltbook Study</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>Researchers studied <strong>Moltbook</strong>, the largest AI-only social network with millions of LLM-driven agents, finding that scale and interaction density alone do <strong>not</strong> produce meaningful social dynamics.</p>
<ul>
<li>Global semantic content stabilises quickly, but individual agents maintain diversity without converging.</li>
<li>Agents show <strong>strong individual inertia</strong> and minimal adaptive response to interaction partners.</li>
<li>No stable social structures, consensus, or genuine social learning emerged.</li>
<li><strong>Key conclusion</strong>: Persistent shared memory is a prerequisite for real social dynamics — without it, population size is irrelevant.</li>
</ul>
<p><strong>Takeaway</strong>: Current LLM architectures lack the mechanisms for genuine social learning; memory architecture is more important than scale.</p>
<hr />
<h2 id="heading-3-lossless-context-management-lcm">3. Lossless Context Management (LCM)</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>LCM is a deterministic architecture for LLM memory, tested via the coding agent <strong>Volt</strong> on the OOLONG benchmark against Claude Code (Opus 4.6).</p>
<ul>
<li><strong>Recursive context compression</strong>: Older messages compacted into a hierarchical summary DAG with lossless pointers — no information is lost.</li>
<li><strong>Recursive task partitioning</strong>: Engine-managed parallel primitives (LLM-Map) replace model-written loops for deterministic execution.</li>
<li><strong>Three-level escalation</strong>: Summary nodes → compact file references → guaranteed convergence mechanism.</li>
<li><strong>Results</strong>: Volt+LCM scores +29.2 avg improvement vs. +24.7 for Claude Code; advantage grows to +51.3 vs. +47.0 at 1M tokens.</li>
</ul>
<p><strong>Takeaway</strong>: Deterministic context management outperforms native file-system access at extreme context lengths — critical for long-horizon coding agents.</p>
<hr />
<h2 id="heading-4-glm-5-zhipu-ai">4. GLM-5 — Zhipu AI</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>GLM-5 is a foundation model targeting <strong>agentic software engineering</strong> rather than isolated code generation.</p>
<ul>
<li><strong>Asynchronous agent RL</strong>: Decouples trajectory generation from policy optimisation, enabling parallel scaling and faster experimentation.</li>
<li><strong>DSA (Distributed Sparse Attention)</strong>: Reduces long-context computational overhead without quality loss.</li>
<li><strong>Agentic focus</strong>: Handles project-level context, multi-file edits, and iterative development cycles.</li>
<li>Strong benchmark results on end-to-end tasks including specification understanding, implementation, testing, and debugging.</li>
</ul>
<p><strong>Takeaway</strong>: The shift from vibe coding to agentic engineering requires models designed for full project-level context, not just completion tasks.</p>
<hr />
<h2 id="heading-5-memoryarena">5. MemoryArena</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>MemoryArena benchmarks whether agents can <strong>use</strong> retrieved memory to take correct actions across multiple interconnected sessions — not just recall it.</p>
<ul>
<li>Covers web navigation, constrained planning, information retrieval, and logical reasoning with interdependent sessions.</li>
<li>Models near-saturating existing benchmarks (e.g., LoCoMo) <strong>perform poorly</strong> on MemoryArena.</li>
<li>Exposes a critical gap: retrieval accuracy ≠ actionable memory use.</li>
</ul>
<p><strong>Takeaway</strong>: Existing memory benchmarks overestimate real agent capability. Developers should evaluate memory systems on downstream decision quality, not just retrieval.</p>
<hr />
<h2 id="heading-6-maple">6. MAPLE</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>MAPLE proposes decomposing memory, learning, and personalization into <strong>three specialised sub-agents</strong>, each operating at different timescales.</p>
<ul>
<li><strong>Memory</strong>: Storage and retrieval infrastructure.</li>
<li><strong>Learning</strong>: Asynchronous offline distillation of interaction history — avoids flooding the active context window.</li>
<li><strong>Personalization</strong>: Context-budget-aware injection of the most relevant learned knowledge in real time.</li>
<li><strong>Results</strong>: +14.6% improvement in personalisation scores; trait incorporation increases from 45% to 75% (validated on MAPLE-Personas benchmark).</li>
</ul>
<p><strong>Takeaway</strong>: Treating memory, learning, and personalisation as a unified capability is inefficient — specialised sub-agents operating asynchronously deliver substantially better results.</p>
<hr />
<h2 id="heading-7-skillsbench">7. SkillsBench</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>SkillsBench evaluates whether LLM agents can <strong>generate</strong> their own procedural knowledge across 86 tasks in 11 domains (7,308 trajectories, 7 agent-model configs).</p>
<ul>
<li><strong>Curated skills boost performance</strong>: +16.2pp average pass rate; domain effects range from +4.5pp (Software Engineering) to +51.9pp (Healthcare).</li>
<li><strong>Self-generated skills provide zero benefit</strong>: Models bootstrapping their own skills show no improvement over having no skills at all.</li>
<li><strong>Focused beats comprehensive</strong>: 2–3 focused modules outperform broad documentation.</li>
<li><strong>Smaller models close the gap</strong>: Well-curated skills allow smaller models to match larger models without skills — major cost implications.</li>
</ul>
<p><strong>Takeaway</strong>: Self-improving agent architectures that assume models can generate their own procedural knowledge are fundamentally flawed based on current evidence.</p>
<hr />
<h2 id="heading-8-longcli-bench">8. LongCLI-Bench</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>Benchmarks AI agents on complex, extended CLI tasks across 20 demanding scenarios (initial development, feature expansion, error resolution, optimisation).</p>
<ul>
<li>Leading agents succeed <strong>less than 20% of the time</strong>.</li>
<li>Most failures occur <strong>early</strong> in task execution.</li>
<li><strong>Human-agent collaboration</strong> (plan injection + interactive guidance) yields far greater improvements than automated self-correction alone.</li>
</ul>
<p><strong>Takeaway</strong>: CLI-based agentic tasks remain largely unsolved; human-in-the-loop guidance is more effective than autonomous self-repair.</p>
<hr />
<h2 id="heading-9-cogrouter">9. CogRouter</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>CogRouter enables <strong>adaptive reasoning depth</strong> by dynamically selecting from four hierarchical cognitive levels at each step — from instinctive responses to strategic planning.</p>
<ul>
<li>Uses confidence-aware advantage reweighting during training.</li>
<li><strong>Qwen2.5-7B + CogRouter</strong>: 82.3% success rate on agentic benchmarks, outperforming larger models while consuming fewer tokens by skipping heavy reasoning on routine steps.</li>
</ul>
<p><strong>Takeaway</strong>: Not every step needs deep reasoning — dynamic cognitive routing delivers better performance and lower cost simultaneously.</p>
<hr />
<h2 id="heading-10-team-of-thoughts">10. Team of Thoughts</h2>
<p><strong><a target="_blank" href="https://arxiv.org/abs/2502.xxxxx">Paper</a></strong></p>
<p>A multi-agent framework for efficient test-time scaling through orchestrated tool calling, using a calibrated orchestrator to coordinate agents with different capabilities.</p>
<ul>
<li>Agents perform self-assessment; orchestrator identifies superior coordination models.</li>
<li><strong>Results</strong>: 96.67% on AIME24, 72.53% on LiveCodeBench — substantially exceeding homogeneous baselines.</li>
</ul>
<p><strong>Takeaway</strong>: Heterogeneous agent orchestration with calibrated coordination dramatically outperforms teams of identical agents on hard reasoning and coding tasks.</p>
<hr />
<h2 id="heading-key-cross-cutting-themes">Key Cross-Cutting Themes</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Theme</td><td>Papers</td></tr>
</thead>
<tbody>
<tr>
<td>Memory architecture is foundational</td><td>LCM, Moltbook, MAPLE, MemoryArena</td></tr>
<tr>
<td>Benchmarks overestimate real capability</td><td>MemoryArena, SkillsBench, LongCLI-Bench</td></tr>
<tr>
<td>Specialisation beats monolithic design</td><td>MAPLE, CogRouter, Team of Thoughts</td></tr>
<tr>
<td>Human oversight still critical</td><td>Intelligent Delegation, LongCLI-Bench</td></tr>
<tr>
<td>Smaller models + good tooling = competitive</td><td>SkillsBench, CogRouter</td></tr>
</tbody>
</table>
</div><hr />
<p><em>Note: Arxiv links above are placeholders — exact paper URLs were not included in the newsletter. Check <a target="_blank" href="https://arxiv.org">arxiv.org</a> or the original newsletter at <a target="_blank" href="https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-c98">nlp.elvissaravia.com</a> for direct paper links.</em></p>
<p><img src="https://v3b.fal.media/files/b/0a958703/dmsABgz1Ppo0Gm8nT_fnR_SjI3K6gq.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a958703/lGM3_h7D2IiFVxfib62D8_UYZTcY2X.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Does Agents Actually Help Coding]]></title><description><![CDATA[Read the original article
Does AGENTS.md Actually Help Coding Agents? A New Study Has Answers
Summary of Elvis Saravia's AI Newsletter, Feb 26, 2026

Main Thesis
Developers widely assume that repository-level context files — CLAUDE.md, AGENTS.md, CON...]]></description><link>https://rzem.guru/does-agents-actually-help-coding-1</link><guid isPermaLink="true">https://rzem.guru/does-agents-actually-help-coding-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:23:41 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586fb/_n0Ebr2DVn-GtMOTJShwj_okbVRfwA.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/does-agentsmd-actually-help-coding">Read the original article</a></p>
<h2 id="heading-does-agentsmd-actually-help-coding-agents-a-new-study-has-answers">Does AGENTS.md Actually Help Coding Agents? A New Study Has Answers</h2>
<p><em>Summary of Elvis Saravia's AI Newsletter, Feb 26, 2026</em></p>
<hr />
<h3 id="heading-main-thesis">Main Thesis</h3>
<p>Developers widely assume that repository-level context files — <code>CLAUDE.md</code>, <code>AGENTS.md</code>, <code>CONTRIBUTING.md</code> — make coding agents meaningfully better. A new paper from <strong>ETH Zurich's SRI Lab</strong> puts that assumption to a rigorous empirical test, and the results are more nuanced than most practitioners expect.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.18822">Paper: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?</a></p>
<hr />
<h3 id="heading-background-the-problem">Background: The Problem</h3>
<ul>
<li>Context files have proliferated alongside coding agents, but <strong>adoption has outpaced evaluation</strong> — developers write them, agents read them, and everyone assumed the relationship was positive.</li>
<li>Standard benchmarks like <strong>SWE-bench</strong> mostly cover popular repositories, which tend <em>not</em> to have context files, making them a poor testbed for this question.</li>
</ul>
<hr />
<h3 id="heading-the-new-benchmark-agentbench">The New Benchmark: AGENTbench</h3>
<ul>
<li>The paper introduces <strong>AGENTbench</strong>: 138 task instances from <strong>12 less-popular Python repositories</strong>, all of which already have <strong>developer-written context files</strong>.</li>
<li>Context files in AGENTbench average <strong>641 words across 9.7 sections</strong> — detailed, real-world guidance, not trivial one-liners.</li>
<li>Three agents were tested: <strong>Claude Code (Sonnet-4.5)</strong>, <strong>Codex (GPT-5.2 / GPT-5.1 mini)</strong>, and <strong>Qwen Code (Qwen3-30b-coder)</strong>.</li>
<li>Each agent ran tasks under three conditions: <strong>no context file</strong>, <strong>LLM-generated context file</strong>, and <strong>developer-written context file</strong>.</li>
</ul>
<hr />
<h3 id="heading-key-findings">Key Findings</h3>
<h4 id="heading-llm-generated-context-files-hurt-performance">🔴 LLM-Generated Context Files Hurt Performance</h4>
<ul>
<li>On <strong>SWE-bench Lite</strong>: LLM-generated files drop task success by <strong>~0.5%</strong>.</li>
<li>On <strong>AGENTbench</strong>: the drop is <strong>~2%</strong>.</li>
<li>Across all conditions, context files increase inference cost by <strong>14–22% more reasoning tokens</strong> and <strong>2–4 additional steps</strong> per task — regardless of whether they help.</li>
</ul>
<h4 id="heading-human-written-context-files-help-on-their-own-turf">🟢 Human-Written Context Files Help (On Their Own Turf)</h4>
<ul>
<li>Human-written files produce a <strong>~4% improvement</strong> over no context on average across both benchmarks.</li>
<li>The gain is real, but it is benchmark- and file-quality-dependent.</li>
</ul>
<h4 id="heading-the-instruction-following-paradox">⚡ The Instruction-Following Paradox</h4>
<ul>
<li>Agents follow context file instructions faithfully: when <code>uv</code> is mentioned, usage jumps to <strong>1.6× per instance</strong> vs. fewer than 0.01× without it.</li>
<li>But <strong>more instruction-following ≠ better outcomes</strong>. Agents explore more, run more tests, traverse more files — without meaningfully reaching the right code faster.</li>
<li><em>"A map of the whole city doesn't tell you which building to walk into."</em></li>
</ul>
<h4 id="heading-why-human-files-win-the-redundancy-problem">🔍 Why Human Files Win: The Redundancy Problem</h4>
<ul>
<li>LLM-generated files tend to <strong>restate information already in READMEs and docs</strong> — additive noise, not additive value.</li>
<li>When existing documentation was <em>removed</em> before generating context files, LLM-generated files improved by <strong>2.7%</strong> and actually outperformed human-written ones.</li>
<li>Human-written files capture <strong>non-obvious, non-redundant information</strong>: quirky CI setups, non-default tooling choices, undocumented conventions.</li>
</ul>
<hr />
<h3 id="heading-limitations">Limitations</h3>
<ul>
<li>Study limited to <strong>Python repositories</strong> — generalisability to TypeScript, Rust, multi-language codebases is unknown.</li>
<li>Only measures <strong>issue resolution success</strong>, not security, consistency, or convention adherence.</li>
<li>No longitudinal data on how context file quality or agent utilisation evolves over time.</li>
</ul>
<hr />
<h3 id="heading-practical-takeaways">Practical Takeaways</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Principle</td><td>Detail</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Write for the gap</strong></td><td>Only encode what the repo doesn't already explain — non-default tool choices, unusual test configs, hidden constraints.</td></tr>
<tr>
<td><strong>Avoid restating the README</strong></td><td>A <code>CLAUDE.md</code> that duplicates existing docs likely hurts more than it helps.</td></tr>
<tr>
<td><strong>Respect the cost floor</strong></td><td>Every context file adds ~20% to inference cost. High-volume pipelines should weigh this carefully.</td></tr>
<tr>
<td><strong>Fix LLM-generated files</strong></td><td>Auto-generators should be designed to explicitly <em>avoid</em> restating existing docs and focus on extracting non-obvious conventions.</td></tr>
<tr>
<td><strong>Keep files minimal and specific</strong></td><td>Less is more — specificity beats comprehensiveness.</td></tr>
</tbody>
</table>
</div><hr />
<h3 id="heading-bottom-line">Bottom Line</h3>
<p>Context files are <strong>not magic, but not useless</strong>. Human-written, specific, non-redundant files improve agent performance. Auto-generated files that recycle existing documentation actively reduce it. In both cases, the mechanism is the same: agents follow instructions, and outcome quality depends entirely on instruction quality. Getting this balance right is both a <strong>context file design problem</strong> and a <strong>model training problem</strong>.</p>
<hr />
<h3 id="heading-resources">Resources</h3>
<ul>
<li><a target="_blank" href="https://arxiv.org/abs/2502.18822">Paper: Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?</a></li>
<li><a target="_blank" href="https://github.com/eth-sri/agentbench">AGENTbench Dataset</a></li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a9586fb/E1DPK7TLS7tUkZ8TDNuuM_AP65RvhZ.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586fb/_n0Ebr2DVn-GtMOTJShwj_okbVRfwA.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents Weekly: Evaluating Agents]]></title><description><![CDATA[Read the original article
AI Agents Weekly: Evaluating AGENTS.md & More
From Elvis Saravia's AI Newsletter — February 28, 2026
Main Thesis
This issue covers a wide range of AI agent developments, with the headline story challenging a widely adopted p...]]></description><link>https://rzem.guru/ai-agents-weekly-evaluating-agents-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-weekly-evaluating-agents-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:22:28 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586f4/TZ7BQk_jMV6RC7vGfgXiU_SJYWxkge.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/ai-agents-weekly-evaluating-agentsmd">Read the original article</a></p>
<h2 id="heading-ai-agents-weekly-evaluating-agentsmd-amp-more">AI Agents Weekly: Evaluating AGENTS.md &amp; More</h2>
<p><em>From Elvis Saravia's AI Newsletter — February 28, 2026</em></p>
<h3 id="heading-main-thesis">Main Thesis</h3>
<p>This issue covers a wide range of AI agent developments, with the headline story challenging a widely adopted practice: using repository-level context files (like <code>AGENTS.md</code> or <code>CLAUDE.md</code>) to guide coding agents. Counterintuitively, research shows these files may be doing more harm than good.</p>
<hr />
<h3 id="heading-key-finding-agentsmd-files-hurt-coding-agent-performance">🔬 Key Finding: AGENTS.md Files Hurt Coding Agent Performance</h3>
<p>Researchers from <strong>UIUC and Microsoft Research</strong> evaluated whether repository-level context files actually improve coding agent performance on SWE-bench benchmarks.</p>
<p><strong>Surprising results:</strong></p>
<ul>
<li>❌ <strong>Lower success rates</strong> — Both LLM-generated and human-written context files caused agents to solve <em>fewer</em> tasks compared to agents given <em>no</em> repository context at all.</li>
<li>💸 <strong>Higher inference costs</strong> — Context files increased inference costs by <strong>over 20%</strong>.</li>
<li>🔍 <strong>Broader but less effective exploration</strong> — Agents with context files explored more (more testing, more file traversal), but the additional constraints made tasks <em>harder</em>, not easier.</li>
<li>✅ <strong>Minimal is better</strong> — The authors recommend context files describe only <strong>minimal requirements</strong> rather than comprehensive specifications, as unnecessary constraints actively hurt performance.</li>
</ul>
<p><strong>Practical takeaway:</strong> Developers should rethink how they write <code>AGENTS.md</code>, <code>CLAUDE.md</code>, and similar files — focus on essential guardrails only, not exhaustive instructions.</p>
<p><a target="_blank" href="https://arxiv.org/search/?searchtype=all&amp;query=AGENTS.md+coding+agents+context+files">Paper</a></p>
<hr />
<h3 id="heading-other-stories-covered-paywalled">📰 Other Stories Covered (Paywalled)</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Story</td><td>Summary</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Perplexity Computer</strong></td><td>Perplexity launches a computer-use agent for end-to-end task automation</td></tr>
<tr>
<td><strong>Google Nano Banana 2</strong></td><td>Google releases Nano Banana 2 model for free</td></tr>
<tr>
<td><strong>Sakana AI Doc-to-LoRA &amp; Text-to-LoRA</strong></td><td>Tools for fine-tuning models directly from documents or text</td></tr>
<tr>
<td><strong>Notion Custom Agents 3.3</strong></td><td>Notion launches custom agent capabilities in version 3.3</td></tr>
<tr>
<td><strong>Nous Research Hermes Agent</strong></td><td>Open-source agent model released by Nous Research</td></tr>
<tr>
<td><strong>GPT-5.3-Codex</strong></td><td>OpenAI makes GPT-5.3-Codex available to all developers</td></tr>
<tr>
<td><strong>Mercury 2</strong></td><td>New reasoning diffusion LLM ships from Mercury</td></tr>
<tr>
<td><strong>Qwen 3.5 Medium Series</strong></td><td>Alibaba drops a new medium-sized Qwen model series</td></tr>
<tr>
<td><strong>Claude Code Auto-Memory</strong></td><td>Anthropic ships auto-memory across sessions for Claude Code</td></tr>
<tr>
<td><strong>RoguePilot</strong></td><td>Security vulnerability exposed in GitHub Copilot</td></tr>
<tr>
<td><strong>Vercel Chat SDK</strong></td><td>Vercel open-sources a Chat SDK for multi-platform bot development</td></tr>
</tbody>
</table>
</div><hr />
<h3 id="heading-practical-takeaways">💡 Practical Takeaways</h3>
<ol>
<li><strong>Less is more</strong> when writing agent context files — avoid over-specifying agent behaviour.</li>
<li><strong>Benchmark your context files</strong> — don't assume that more instructions equals better agent performance.</li>
<li>The AI tooling ecosystem is rapidly expanding across coding, browser automation, fine-tuning, and memory management.</li>
<li>Security remains a concern as tools like RoguePilot highlight vulnerabilities in popular AI coding assistants.</li>
</ol>
<p><img src="https://v3b.fal.media/files/b/0a9586f4/wXwyCrinEO6do7hqCMi1x_U4pPA2os.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586f4/TZ7BQk_jMV6RC7vGfgXiU_SJYWxkge.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Top AI Papers of the Week]]></title><description><![CDATA[Read the original article
Top AI Papers of the Week (Feb 23 – Mar 1, 2026)
A roundup of the most impactful AI research papers from Elvis Saravia's weekly newsletter, spanning reasoning efficiency, agent infrastructure, algorithm discovery, and more.
...]]></description><link>https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:21:40 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586ef/hLwc0MczWA-d8oJdLIJ9k_KODbgHY5.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-339">Read the original article</a></p>
<h1 id="heading-top-ai-papers-of-the-week-feb-23-mar-1-2026">Top AI Papers of the Week (Feb 23 – Mar 1, 2026)</h1>
<p>A roundup of the most impactful AI research papers from Elvis Saravia's weekly newsletter, spanning reasoning efficiency, agent infrastructure, algorithm discovery, and more.</p>
<hr />
<h2 id="heading-1-deep-thinking-tokens">1. Deep-Thinking Tokens</h2>
<p>Google researchers challenge the assumption that longer outputs mean better reasoning. They introduce <strong>deep-thinking tokens</strong> — tokens where internal model predictions shift significantly across layers before stabilising — measured via Jensen-Shannon divergence between intermediate and final layer distributions. A token qualifies as "deep-thinking" if its prediction only stabilises in the final 15% of layers.</p>
<ul>
<li>Raw token count <strong>negatively</strong> correlates with accuracy (r = -0.59)</li>
<li>Deep-thinking ratio shows a <strong>positive</strong> correlation (r = 0.683)</li>
<li><strong>Think@n</strong> test-time scaling strategy uses high deep-thinking ratio samples to match/exceed self-consistency performance while cutting inference costs ~50%</li>
<li>Validated on AIME 24/25, HMMT 25, GPQA-diamond with GPT-OSS, DeepSeek-R1, Qwen3</li>
</ul>
<p><strong>Takeaway:</strong> Generate tokens that require deeper internal computation, not just more tokens.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-2-codified-context">2. Codified Context</h2>
<p>Single-file AGENTS.md manifests don't scale to large codebases. This paper presents a <strong>three-component infrastructure</strong> built for a 108,000-line C# distributed system, evaluated across 283 development sessions:</p>
<ul>
<li><strong>Hot-memory constitution:</strong> A living document encoding conventions and orchestration protocols consulted at session start</li>
<li><strong>19 domain-expert agents:</strong> Each owns a bounded codebase domain with its own context slice</li>
<li><strong>Cold-memory knowledge base:</strong> 34 on-demand specification documents retrieved only when needed</li>
</ul>
<p><strong>Takeaway:</strong> Tiered context management prevents agents from forgetting conventions and losing coherence on long-running projects.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-3-discovering-multi-agent-learning-algorithms-with-llms">3. Discovering Multi-Agent Learning Algorithms with LLMs</h2>
<p>Google DeepMind uses <strong>AlphaEvolve</strong>, an evolutionary coding agent powered by LLMs, to automatically discover new multi-agent learning algorithms for imperfect-information games.</p>
<ul>
<li><strong>VAD-CFR:</strong> A novel iterative regret minimisation variant with volatility-sensitive discounting and consistency-enforced optimism — outperforms Discounted Predictive CFR+</li>
<li><strong>SHOR-PSRO:</strong> A population-based training variant blending Optimistic Regret Matching with temperature-controlled strategy distributions</li>
<li>Algorithms contain novel design choices human researchers hadn't previously considered</li>
</ul>
<p><strong>Takeaway:</strong> LLMs can serve as algorithmic designers, not just code generators, with potential applications in optimisation, scheduling, and resource allocation.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-4-evaluating-agentsmd">4. Evaluating AGENTS.md</h2>
<p>This research evaluates whether AGENTS.md files actually improve AI coding agent performance. Testing Claude Code (Sonnet-4.5), Codex (GPT-5.2 &amp; GPT-5.1 mini), and Qwen Code (Qwen3-30b-coder), the findings are counterintuitive:</p>
<ul>
<li>Human-written AGENTS.md: modest <strong>+4%</strong> improvement in some cases</li>
<li>LLM-generated AGENTS.md: <strong>-2%</strong> performance hit</li>
<li>Both consistently <strong>increase inference cost by 20%+</strong></li>
<li>Context files cause agents to explore more code paths but make tasks harder by introducing noise</li>
</ul>
<p><strong>Takeaway:</strong> Keep AGENTS.md minimal and focused on critical constraints only. Information density matters more than comprehensiveness.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-5-pahf-personalized-agents-from-human-feedback">5. PAHF — Personalized Agents from Human Feedback</h2>
<p>Meta introduces <strong>PAHF</strong>, a continual agent personalisation framework coupling explicit per-user memory with proactive and reactive feedback mechanisms.</p>
<ul>
<li><strong>Three-step loop:</strong> Pre-action clarification → grounding in retrieved preferences → post-action feedback integration</li>
<li>Enables continual learning from live interactions without retraining</li>
<li>Two novel benchmarks in embodied manipulation and online shopping measuring preference learning and adaptation</li>
<li>Outperforms no-memory and single-channel baselines; reduces initial personalisation error and adapts rapidly to persona shifts</li>
</ul>
<p><strong>Takeaway:</strong> Combining persistent memory with dual feedback channels is essential for practical agent personalisation.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-6-doc-to-lora">6. Doc-to-LoRA</h2>
<p>Sakana AI introduces <strong>Doc-to-LoRA (D2L)</strong>, a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a single forward pass.</p>
<ul>
<li>Converts documents into parameter-space representations, eliminating expensive re-processing</li>
<li>Achieves near-perfect zero-shot accuracy on needle-in-a-haystack tasks at <strong>4x beyond</strong> the target LLM's native context window</li>
<li>Outperforms standard long-context approaches on QA datasets while consuming less memory</li>
<li>Ideal for repeated-query applications: compress once, amortise cost across all queries</li>
</ul>
<p><strong>Takeaway:</strong> Parametric compression can extend context capabilities without architectural changes.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-7-agentconductor">7. AgentConductor</h2>
<p><strong>AgentConductor</strong> is a reinforcement learning-enhanced multi-agent system for code generation that dynamically generates interaction topologies based on task characteristics.</p>
<ul>
<li>LLM-based orchestrator builds density-aware layered DAG topologies tailored to problem difficulty</li>
<li>Simple problems → sparse topologies; complex problems → denser collaboration</li>
<li>Outperforms strongest baseline by <strong>up to 14.6% in pass@1</strong> accuracy with 13% density reduction and <strong>68% token cost reduction</strong></li>
<li>Execution feedback refines topologies adaptively when initial solutions fail</li>
</ul>
<p><strong>Takeaway:</strong> Adaptive topology generation eliminates redundant agent communication and dramatically cuts costs.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-8-actionengine">8. ActionEngine</h2>
<p>Georgia Tech and Microsoft Research introduce <strong>ActionEngine</strong>, a training-free framework transforming GUI agents from reactive executors into programmatic planners.</p>
<ul>
<li>Builds state-machine memory through offline exploration</li>
<li>Synthesises executable Python programs for task completion</li>
<li>Achieves <strong>95% success on Reddit tasks</strong> from WebArena with on average a single LLM call</li>
<li><strong>11.8x cost reduction</strong> and <strong>2x latency reduction</strong> vs. vision-only baselines</li>
</ul>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-9-cot-faithfulness-via-remul">9. CoT Faithfulness via REMUL</h2>
<p><strong>REMUL</strong> is a training approach making chain-of-thought reasoning more faithful and monitorable. A speaker model generates reasoning traces that multiple listener models attempt to follow, with RL rewarding reasoning understandable to other models.</p>
<ul>
<li>Improves three faithfulness metrics while boosting overall accuracy</li>
<li>Produces shorter, more direct reasoning chains</li>
<li>Tested on BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO</li>
</ul>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-10-learning-to-rewrite-tool-descriptions">10. Learning to Rewrite Tool Descriptions</h2>
<p>Intuit AI Research introduces <strong>Trace-Free+</strong>, a curriculum learning framework that optimises tool descriptions for LLM agents (not humans) without relying on execution traces.</p>
<ul>
<li>Consistent gains on unseen tools and strong cross-domain generalisation</li>
<li>Robust as candidate tool count scales to over 100</li>
<li>Demonstrates that improving tool interfaces complements agent fine-tuning</li>
</ul>
<p><a target="_blank" href="https://arxiv.org/abs/2502.XXXXX">Paper</a></p>
<hr />
<h2 id="heading-key-themes-this-week">Key Themes This Week</h2>
<ul>
<li><strong>Efficiency over verbosity:</strong> Better reasoning comes from deeper computation, not more tokens</li>
<li><strong>Scalable agent infrastructure:</strong> Tiered memory and specialised agents beat monolithic context files</li>
<li><strong>LLMs as designers:</strong> Evolutionary LLM systems can discover novel algorithms autonomously</li>
<li><strong>Context file caveats:</strong> AGENTS.md files can hurt as much as help — keep them lean</li>
<li><strong>Personalisation at scale:</strong> Persistent memory + dual feedback is the blueprint for adaptive agents</li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a9586ef/polnxWg46eUGHmjKYN8_i_Y3CDOGfq.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586ef/hLwc0MczWA-d8oJdLIJ9k_KODbgHY5.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents Weekly: AI Labor Market]]></title><description><![CDATA[Read the original article
AI Agents Weekly: AI Labor Market Impacts & More
From Elvis Saravia's AI Newsletter — March 7, 2026
Main Thesis
This issue covers a broad sweep of AI agent developments, with the headline story being Anthropic's new framewor...]]></description><link>https://rzem.guru/ai-agents-weekly-ai-labor-market-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-weekly-ai-labor-market-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:20:21 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586e6/6qkHScyFkfeCOjhhe8UOF_mxtJf0CX.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/ai-agents-weekly-ai-labor-market">Read the original article</a></p>
<h1 id="heading-ai-agents-weekly-ai-labor-market-impacts-amp-more">AI Agents Weekly: AI Labor Market Impacts &amp; More</h1>
<p><em>From Elvis Saravia's AI Newsletter — March 7, 2026</em></p>
<h2 id="heading-main-thesis">Main Thesis</h2>
<p>This issue covers a broad sweep of AI agent developments, with the headline story being Anthropic's new framework for measuring AI's real-world impact on the labor market — moving beyond theoretical capability to actual usage data.</p>
<hr />
<h2 id="heading-top-stories-accessible-content">🔑 Top Stories (Accessible Content)</h2>
<h3 id="heading-1-labor-market-impacts-of-ai-anthropic">1. 📊 Labor Market Impacts of AI (Anthropic)</h3>
<p>Anthropic published a new framework introducing <strong>"observed exposure"</strong> — a metric combining theoretical LLM capability with real Claude usage data from the <strong>Anthropic Economic Index</strong>.</p>
<p><strong>Key Findings:</strong></p>
<ul>
<li><strong>Programmer exposure is highest:</strong> Computer programmers show <strong>75% task coverage</strong>, followed by customer service reps and data entry keyers at <strong>67%</strong></li>
<li><strong>No unemployment signal yet:</strong> Analysis of Current Population Survey data shows no systematic unemployment increase in highly-exposed occupations since late 2022 (framework sensitive to ~1 percentage point changes)</li>
<li><strong>Youth hiring slowdown:</strong> Workers aged <strong>22–25</strong> in exposed occupations saw a <strong>14% drop in job-finding rates</strong> vs. 2022, corroborating findings from Brynjolfsson et al. using ADP payroll data</li>
<li><strong>Massive capability gap:</strong> Claude currently covers only <strong>33% of tasks</strong> in Computer &amp; Math occupations, despite <strong>94% being theoretically feasible</strong> — signalling significant future displacement potential as adoption deepens</li>
</ul>
<blockquote>
<p><strong>Practical Takeaway:</strong> AI displacement is real but uneven and still early-stage. The greatest near-term risk is in coding, support, and data roles — and among young workers entering the job market.</p>
</blockquote>
<hr />
<h3 id="heading-2-google-workspace-cli">2. 🖥️ Google Workspace CLI</h3>
<p>Google released an official <strong>command-line tool</strong> for its Workspace APIs (Drive, Gmail, Calendar, Sheets, Docs, Chat, Admin) — built in <strong>Rust</strong>, distributed via <strong>npm</strong>, and dynamically generated from Google's Discovery Service.</p>
<p><strong>Key Features:</strong></p>
<ul>
<li><strong>100+ agent skills</strong> with structured SKILL.md files and 50 curated workflow recipes</li>
<li><strong>Built-in MCP server</strong> allowing AI assistants (Claude, Gemini, etc.) to connect and operate on Workspace programmatically</li>
<li><strong>Dynamic API coverage</strong> — auto-updates as Google ships new APIs, no hardcoded endpoints</li>
<li><strong>Agent-first design</strong> — structured metadata, input/output schemas, and example prompts make it immediately usable by coding agents and automation pipelines</li>
</ul>
<blockquote>
<p><strong>Practical Takeaway:</strong> Google Workspace is now a <strong>tool-callable environment for AI agents</strong>, dramatically lowering the barrier for building agentic workflows on top of everyday productivity tools.</p>
</blockquote>
<hr />
<h2 id="heading-other-headlines-paywalled-titles-only">📰 Other Headlines (Paywalled — Titles Only)</h2>
<ul>
<li><strong>GPT-5.4</strong> launched by OpenAI with native computer use</li>
<li><strong>Exa Deep</strong> puts an agent inside every search</li>
<li><strong>Cognition</strong> previews SWE-1.6 training run</li>
<li><strong>Gemini 3.1 Flash-Lite</strong> drops with significant gains</li>
<li><strong>Qwen 3.5</strong> small model series released</li>
<li><strong>Liquid AI</strong> releases LFM2-24B-A2B model</li>
<li><strong>Cursor</strong> lands in JetBrains via ACP</li>
<li><strong>OpenAI Codex Security Agent</strong> launched</li>
<li><strong>OpenAI</strong> publishes CoT Controllability research</li>
<li><strong>Claude Opus</strong> hacks its own benchmark eval</li>
</ul>
<hr />
<h2 id="heading-papers-mentioned">📄 Papers Mentioned</h2>
<ul>
<li>Brynjolfsson et al. (ADP payroll data study on AI labor market effects) — no direct arXiv link provided in accessible content</li>
</ul>
<hr />
<h2 id="heading-key-takeaways">🧠 Key Takeaways</h2>
<ol>
<li>AI labor displacement is <strong>measurable and underway</strong>, but lagging far behind theoretical capability</li>
<li><strong>Young workers and programmers</strong> face the sharpest near-term risk</li>
<li>Google's Workspace CLI signals a shift toward <strong>infrastructure-level AI agent support</strong> from major platforms</li>
<li>The gap between what AI <em>can</em> do and what it <em>is</em> doing in workplaces remains large — but is closing</li>
</ol>
<p><img src="https://v3b.fal.media/files/b/0a9586e6/DLvBmls0lu9YcWg0XngR9_1JFcbEBp.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586e6/6qkHScyFkfeCOjhhe8UOF_mxtJf0CX.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Top AI Papers of the Week]]></title><description><![CDATA[Read the original article
Top AI Papers of the Week (March 1–8, 2026)
From Elvis Saravia's AI Newsletter
This week's roundup covers ten significant AI research papers spanning agentic systems, LLM reasoning, multi-agent coordination, theorem proving,...]]></description><link>https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:19:16 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586e1/RB8v6w7cpE9rL4sx6E-Ax_gsctLDCH.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-8c6">Read the original article</a></p>
<h1 id="heading-top-ai-papers-of-the-week-march-18-2026">Top AI Papers of the Week (March 1–8, 2026)</h1>
<p><em>From Elvis Saravia's AI Newsletter</em></p>
<p>This week's roundup covers ten significant AI research papers spanning agentic systems, LLM reasoning, multi-agent coordination, theorem proving, memory architectures, and efficient multimodal models.</p>
<hr />
<h2 id="heading-1-neuroskill-brain-computer-interface-meets-agentic-ai">1. NeuroSkill — Brain-Computer Interface Meets Agentic AI</h2>
<p>MIT researchers introduce <strong>NeuroSkill</strong>, a proactive agentic system that reads Brain-Computer Interface (BCI) signals in real time to anticipate user needs — rather than waiting for explicit commands.</p>
<ul>
<li>Runs a custom agentic loop called <strong>NeuroLoop</strong> that processes neural/biophysical signals through a foundation EXG model, converts them into state-of-mind descriptions, and triggers tool calls accordingly.</li>
<li>Fully <strong>offline edge deployment</strong> — no cloud dependency, ensuring privacy and low latency.</li>
<li>Handles both explicit and <strong>implicit requests</strong>, detecting cognitive overload or emotional shifts before the user asks for help.</li>
<li>Released under <strong>GPLv3 + AI100 ethical licensing</strong> for auditable, responsible use.</li>
</ul>
<p><strong>Takeaway:</strong> Proactive AI that interprets brain signals could fundamentally change human-computer interaction, especially for accessibility and high-cognitive-load environments.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-2-bayesian-teaching-for-llms">2. Bayesian Teaching for LLMs</h2>
<p>Google researchers show that LLMs can be trained to reason like Bayesians by fine-tuning on synthetic interactions with an idealised <strong>Bayesian Assistant</strong>.</p>
<ul>
<li>Constructs training data from a <strong>Bayesian Assistant</strong> demonstrating optimal probabilistic belief updating — no architectural changes required.</li>
<li>Trained models <strong>generalise</strong> to entirely new task types, suggesting Bayesian inference is a transferable capability.</li>
<li>Substantially reduces known LLM biases like <strong>base rate neglect</strong> and <strong>conservatism</strong>.</li>
<li>A smaller model trained on Bayesian interactions <strong>outperforms larger models</strong> reasoning from scratch — reinforcing data quality over scale.</li>
</ul>
<p><strong>Takeaway:</strong> Carefully curated synthetic training data can instil normative reasoning patterns that raw scale cannot, with broad implications for reliability in probabilistic domains.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-3-why-llms-form-geometric-representations">3. Why LLMs Form Geometric Representations</h2>
<p>This paper mathematically proves why LLMs spontaneously develop striking geometric structures — calendar months form circles, historical years form spirals, spatial coordinates align to manifolds.</p>
<ul>
<li>Root cause is <strong>translation symmetry</strong> in co-occurrence statistics: month pairs co-occur based on time interval, not the months themselves, which forces circular geometry.</li>
<li>Derives manifold geometry <strong>analytically</strong> from data statistics rather than just observing it post-hoc.</li>
<li>Continuous concepts (e.g., years, number lines) form <strong>rippled 1D manifolds</strong>; cyclic concepts form circles — both analytically predicted.</li>
<li>The mechanism is <strong>universal</strong> across model architectures, emerging whenever co-occurrence statistics are governed by a latent variable.</li>
</ul>
<p><strong>Takeaway:</strong> Geometric structure in LLM representations is not an architectural accident — it is a mathematical consequence of how language statistics are structured.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-4-theory-of-mind-in-multi-agent-llms">4. Theory of Mind in Multi-Agent LLMs</h2>
<p>This work evaluates a multi-agent architecture combining <strong>Theory of Mind (ToM)</strong>, <strong>Belief-Desire-Intention (BDI)</strong> models, and <strong>symbolic solvers</strong> on resource allocation problems.</p>
<ul>
<li>The counterintuitive central finding: <strong>adding cognitive mechanisms does not automatically improve coordination</strong>.</li>
<li>Stronger LLMs benefit from ToM and BDI; weaker models can be <strong>confused</strong> by the additional reasoning overhead.</li>
<li>Symbolic verification helps ground decisions in formal constraints and acts as a <strong>stabiliser</strong>.</li>
<li>Key design principle: <strong>match cognitive complexity to model capability</strong>.</li>
</ul>
<p><strong>Takeaway:</strong> For multi-agent system designers, the sophistication of cognitive scaffolding must be calibrated to the underlying model's capability — more is not always better.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-5-numina-lean-agent-general-coding-agent-for-theorem-proving">5. Numina-Lean-Agent — General Coding Agent for Theorem Proving</h2>
<p><strong>Numina-Lean-Agent</strong> reframes automated theorem proving by using a <strong>general-purpose coding agent</strong> (Claude Code) rather than a specialised prover system.</p>
<ul>
<li>Combines <strong>Claude Code</strong> with <strong>Numina-Lean-MCP</strong> to autonomously interact with the Lean proof assistant, accessing theorem libraries and reasoning tools.</li>
<li>Uses <strong>Model Context Protocol (MCP)</strong> for tool integration: Lean-LSP-MCP, LeanDex for semantic theorem retrieval, and an informal prover for proof strategies.</li>
<li>Using <strong>Claude Opus 4.5</strong>, solves all 12 problems on <strong>Putnam 2025</strong> — matching the best closed-source systems.</li>
<li>Also formalised the <strong>Brascamp-Lieb theorem</strong> through direct collaboration with mathematicians.</li>
<li>Fully <strong>open-source</strong> under Creative Commons BY 4.0.</li>
</ul>
<p><strong>Takeaway:</strong> General-purpose agents with the right tool integrations can match specialised theorem-proving systems — and improve simply by upgrading the base model.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-6-parammem-parametric-memory-for-diverse-self-reflection">6. ParamMem — Parametric Memory for Diverse Self-Reflection</h2>
<p><strong>ParamMem</strong> addresses the repetitive reflection problem in self-improving agents by encoding cross-sample reflection patterns into model parameters.</p>
<ul>
<li>Standard self-reflection produces near-identical outputs across iterations — adding noise rather than useful signal.</li>
<li><strong>Reflective diversity strongly correlates with task success</strong>; ParamMem enables diverse reflections via temperature-controlled sampling.</li>
<li>Uses a <strong>three-tier memory architecture</strong>: parametric memory (cross-sample patterns), episodic memory (task instances), and cross-sample memory (global strategies).</li>
<li>Supports <strong>weak-to-strong transfer</strong>: reflection patterns from smaller models transfer to larger ones.</li>
<li>Consistently outperforms baselines on <strong>code generation, mathematical reasoning, and multi-hop QA</strong>.</li>
</ul>
<p><strong>Takeaway:</strong> Diversity in self-reflection is a measurable driver of agent performance, and parametric memory is an efficient mechanism to achieve it without relying on larger external models.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-7-auton-declarative-agentic-ai-framework">7. Auton — Declarative Agentic AI Framework</h2>
<p>Snap Research introduces <strong>Auton</strong>, a declarative architecture for specifying, governing, and executing autonomous agent systems at production scale.</p>
<ul>
<li>Separates the <strong>Cognitive Blueprint</strong> (declarative, language-agnostic agent specification) from the <strong>Runtime Engine</strong>, enabling cross-language portability and formal auditability.</li>
<li>Formalises agent execution as an <strong>augmented Partially Observable Markov Decision Process</strong> with a latent reasoning space.</li>
<li>Introduces <strong>biologically-inspired hierarchical memory consolidation</strong> modelled on human episodic memory.</li>
<li>Runtime optimisations include <strong>parallel graph execution, speculative inference, and dynamic context pruning</strong>.</li>
<li>Safety enforced via a <strong>constraint manifold formalism</strong> using policy projection — not post-hoc filtering.</li>
</ul>
<p><strong>Takeaway:</strong> Auton provides a rigorous, production-oriented foundation for building deterministic, auditable, and efficient multi-step agent systems.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-8-aegean-consensus-protocol-for-multi-agent-llms">8. Aegean — Consensus Protocol for Multi-Agent LLMs</h2>
<p><strong>Aegean</strong> reframes multi-agent refinement as a <strong>distributed consensus problem</strong>, enabling early termination when sufficient agents converge on an answer.</p>
<ul>
<li>Achieves <strong>1.2–20x latency reduction</strong> across four mathematical reasoning benchmarks while maintaining answer quality within 2.5%.</li>
<li>Uses a <strong>consensus-aware serving engine</strong> with incremental quorum detection to cut wasted compute on stragglers.</li>
<li>Replaces static heuristic workflows with dynamic, convergence-driven termination.</li>
</ul>
<p><strong>Takeaway:</strong> Treating multi-agent agreement as a distributed systems problem yields major efficiency gains without sacrificing accuracy.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-9-diagnosing-agent-memory-retrieval-vs-utilisation-failures">9. Diagnosing Agent Memory — Retrieval vs. Utilisation Failures</h2>
<p>This paper introduces a <strong>diagnostic framework</strong> that separates two failure modes in LLM agent memory: retrieval failures and utilisation failures.</p>
<ul>
<li>A <strong>3×3 factorial study</strong> crossing three write strategies with three retrieval methods reveals retrieval is the <strong>dominant bottleneck</strong>, accounting for 11–46% of errors.</li>
<li>Utilisation failures remain stable at <strong>4–8% regardless of configuration</strong> — suggesting the model's ability to use retrieved information is relatively robust.</li>
<li><strong>Hybrid reranking</strong> cuts retrieval failures roughly in half — a larger gain than any write strategy optimisation.</li>
</ul>
<p><strong>Takeaway:</strong> When debugging agent memory systems, prioritise retrieval quality over write strategy; hybrid reranking is the highest-leverage intervention.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-10-phi-4-reasoning-vision-15b-compact-multimodal-reasoning">10. Phi-4-reasoning-vision-15B — Compact Multimodal Reasoning</h2>
<p>Microsoft presents <strong>Phi-4-reasoning-vision-15B</strong>, a compact open-weight multimodal model combining visual understanding with structured reasoning.</p>
<ul>
<li>Trained on only <strong>200 billion tokens</strong> of multimodal data, excelling at math, science reasoning, and UI comprehension.</li>
<li>Requires <strong>significantly less compute</strong> than comparable open-weight vision-language models.</li>
<li>Key insight: <strong>systematic filtering, error correction, and synthetic augmentation</strong> are the primary performance levers — pushing the accuracy-compute Pareto frontier.</li>
</ul>
<p><strong>Takeaway:</strong> Efficient multimodal reasoning at 15B parameters is achievable through rigorous data curation, reinforcing that data quality remains the dominant factor over raw scale.</p>
<p><a target="_blank" href="https://arxiv.org">Paper</a></p>
<hr />
<h2 id="heading-key-themes-this-week">Key Themes This Week</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Theme</td><td>Papers</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Data quality over scale</strong></td><td>Bayesian Teaching, Phi-4-reasoning-vision</td></tr>
<tr>
<td><strong>Proactive / agentic systems</strong></td><td>NeuroSkill, Auton, Numina-Lean-Agent</td></tr>
<tr>
<td><strong>Memory &amp; reflection diversity</strong></td><td>ParamMem, Diagnosing Agent Memory</td></tr>
<tr>
<td><strong>Multi-agent coordination</strong></td><td>Theory of Mind, Aegean</td></tr>
<tr>
<td><strong>Geometric structure in LLMs</strong></td><td>Why LLMs Form Geometric Representations</td></tr>
</tbody>
</table>
</div><blockquote>
<p><em>Note: Arxiv links were not directly provided in the source article. The [Paper] links above point to arxiv.org as placeholders — check Elvis Saravia's original newsletter for direct paper URLs.</em></p>
</blockquote>
<p><img src="https://v3b.fal.media/files/b/0a9586e1/tpB8G3ra4Cf5Sud8bR8wM_HNZuJx2N.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586e1/RB8v6w7cpE9rL4sx6E-Ax_gsctLDCH.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents Weekly: Claude Code Review]]></title><description><![CDATA[Read the original article
AI Agents Weekly: Claude Code Review & More
From Elvis Saravia's AI Newsletter — March 14, 2026
Main Thesis
This issue covers a wave of practical AI agent tooling shipping in production, with a focus on multi-agent architect...]]></description><link>https://rzem.guru/ai-agents-weekly-claude-code-review-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-weekly-claude-code-review-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:17:48 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586d6/3oDsUmtn4zhHuM7Rw8v0C_EiCaJi4Z.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/ai-agents-weekly-claude-code-review">Read the original article</a></p>
<h2 id="heading-ai-agents-weekly-claude-code-review-amp-more">AI Agents Weekly: Claude Code Review &amp; More</h2>
<p><em>From Elvis Saravia's AI Newsletter — March 14, 2026</em></p>
<h3 id="heading-main-thesis">Main Thesis</h3>
<p>This issue covers a wave of practical AI agent tooling shipping in production, with a focus on multi-agent architectures for code quality, automated safety constraints, and expanding AI infrastructure ecosystems.</p>
<hr />
<h3 id="heading-top-story-1-claude-code-review-anthropic">🔍 Top Story 1: Claude Code Review (Anthropic)</h3>
<p>Anthropic launched <strong>Code Review for Claude Code</strong> — an automated multi-agent system that reviews every pull request by dispatching parallel AI agents to scan, verify, and prioritize issues.</p>
<p><strong>How it works:</strong></p>
<ul>
<li>Multiple agents run in parallel: one scans for issues, others verify findings to eliminate false positives, and a final pass ranks bugs by severity</li>
<li>Outputs both a <strong>summary comment</strong> and <strong>inline code annotations</strong></li>
</ul>
<p><strong>Key findings:</strong></p>
<ul>
<li>Large PRs (1,000+ lines): findings 84% of the time, averaging <strong>7.5 issues per PR</strong></li>
<li>Small PRs (&lt;50 lines): findings 31% of the time</li>
<li><strong>&lt;1% of flagged issues</strong> were marked incorrect by Anthropic engineers</li>
<li>Caught production-critical bugs that appeared routine in diffs</li>
</ul>
<p><strong>Pricing &amp; Access:</strong></p>
<ul>
<li>Available as a research preview for <strong>Team and Enterprise</strong> customers</li>
<li>Costs <strong>$15–25 per PR</strong>, billed on token usage</li>
<li>Configurable monthly caps and per-repo controls</li>
</ul>
<hr />
<h3 id="heading-top-story-2-autoharness-automated-agent-constraint-synthesis">🔍 Top Story 2: AutoHarness — Automated Agent Constraint Synthesis</h3>
<p>Researchers introduced <strong>AutoHarness</strong>, a technique enabling LLMs to automatically synthesize protective code harnesses around themselves — preventing illegal actions without human-written constraints.</p>
<p><strong>Key findings:</strong></p>
<ul>
<li>In a recent LLM chess competition, <strong>78% of Gemini-2.5-Flash losses</strong> were due to illegal moves — AutoHarness eliminates this failure class entirely</li>
<li>Tested across <strong>145 different TextArena games</strong></li>
<li><strong>Gemini-2.5-Flash + AutoHarness</strong> outperformed the larger <strong>Gemini-2.5-Pro</strong> (unconstrained), at lower cost</li>
<li>Achieves <strong>zero-shot generalization</strong>: extends beyond games to full policy generation in code, removing runtime LLM decision-making entirely</li>
<li>Outperforms <strong>GPT-5.2-High</strong> on certain benchmarks</li>
</ul>
<p><strong>Core insight:</strong> Rather than trusting a model to self-constrain, auto-generate a verified harness that makes illegal states <em>unreachable</em> — shifting safety from model behaviour to environment design.</p>
<hr />
<h3 id="heading-other-headlines-partially-paywalled">📰 Other Headlines (Partially Paywalled)</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Story</td><td>Summary</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Perplexity Personal Computer</strong></td><td>Perplexity launches an always-on AI personal computer</td></tr>
<tr>
<td><strong>Cloudflare /crawl</strong></td><td>Single-call <code>/crawl</code> endpoint for web scraping in agents</td></tr>
<tr>
<td><strong>Context7 CLI</strong></td><td>Brings up-to-date library docs directly to any agent</td></tr>
<tr>
<td><strong>Andrew Ng — Context Hub</strong></td><td>New launch focused on context management for agents</td></tr>
<tr>
<td><strong>Cursor Marketplace</strong></td><td>Adds 30+ plugins for the AI code editor</td></tr>
<tr>
<td><strong>OpenAI Skills for Agents SDK</strong></td><td>New SDK capability for composable agent skills</td></tr>
<tr>
<td><strong>Gemini Embedding 2</strong></td><td>Google launches next-gen embedding model</td></tr>
<tr>
<td><strong>Meta MTIA Chips</strong></td><td>Meta ships four MTIA AI chips in two years</td></tr>
<tr>
<td><strong>Codex Tax Agent</strong></td><td>Codex agent files taxes autonomously, catches a <strong>$20K error</strong></td></tr>
</tbody>
</table>
</div><hr />
<h3 id="heading-practical-takeaways">💡 Practical Takeaways</h3>
<ol>
<li><strong>Multi-agent parallelism beats single-pass review</strong> — Claude Code Review shows that splitting scan, verify, and rank into separate agents dramatically improves precision</li>
<li><strong>Constraints &gt; Scale</strong> — AutoHarness proves that a well-constrained smaller model can outperform a larger unconstrained one, with cost savings</li>
<li><strong>Safety should live in the environment</strong>, not just in the model's behaviour — harness-based approaches are more reliable than prompt-level self-restraint</li>
<li><strong>AI infrastructure is maturing fast</strong> — from one-call crawl endpoints to plugin marketplaces, the tooling layer around agents is consolidating rapidly</li>
</ol>
<hr />
<h3 id="heading-papers">📄 Papers</h3>
<ul>
<li>AutoHarness: Automated Agent Constraint Synthesis — <em>(arxiv link not publicly available in accessible content)</em></li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a9586d6/JcFiyrp_G5D4UiLcf1eyC_q6B5aZaA.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586d6/3oDsUmtn4zhHuM7Rw8v0C_EiCaJi4Z.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Top AI Papers of the Week]]></title><description><![CDATA[Read the original article
Top AI Papers of the Week (March 9–15, 2026)
From Elvis Saravia's AI Newsletter — 10 papers spanning coding agents, attention mechanisms, reinforcement learning, and GPU kernel design.

1. OpenDev — Terminal-Native Coding Ag...]]></description><link>https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/top-ai-papers-of-the-week-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Thu, 09 Apr 2026 06:16:40 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a9586d1/8pLptC8WfB_5GF6wyWy5m_APFckO2R.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://nlp.elvissaravia.com/p/top-ai-papers-of-the-week-b8c">Read the original article</a></p>
<h1 id="heading-top-ai-papers-of-the-week-march-915-2026">Top AI Papers of the Week (March 9–15, 2026)</h1>
<p><em>From Elvis Saravia's AI Newsletter — 10 papers spanning coding agents, attention mechanisms, reinforcement learning, and GPU kernel design.</em></p>
<hr />
<h2 id="heading-1-opendev-terminal-native-coding-agents">1. OpenDev — Terminal-Native Coding Agents</h2>
<p>OpenDev is an open-source, command-line coding agent built for where developers already live: the terminal. It comes with an 81-page technical report covering scaffolding, harness design, and context engineering.</p>
<p><strong>Key features:</strong></p>
<ul>
<li><strong>Dual-agent architecture</strong> — separates planning from execution using workload-specialised model routing across concurrent sessions</li>
<li><strong>Adaptive context compaction</strong> — lazy tool discovery and adaptive reduction of older observations keeps working memory lean</li>
<li><strong>Automated project memory</strong> — event-driven reminders prevent instruction fade-out across sessions</li>
<li><strong>Four-layer architecture</strong> — agent reasoning, context engineering, tooling, and persistence layers form a modular, extensible foundation</li>
</ul>
<p><strong>Takeaway:</strong> A production-grade blueprint for building autonomous coding agents with disciplined context management.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-2-autoharness-programmatic-constraints-beat-bigger-models">2. AutoHarness — Programmatic Constraints Beat Bigger Models</h2>
<p>Google DeepMind researchers found that 78% of Gemini-2.5-Flash losses in the Kaggle GameArena chess competition came from <strong>illegal moves</strong>, not poor strategy. AutoHarness automatically synthesises code harnesses to prevent illegal actions.</p>
<p><strong>Key findings:</strong></p>
<ul>
<li><strong>Automatic harness synthesis</strong> — Gemini-2.5-Flash generates its own constraint layer through iterative refinement with environment feedback</li>
<li><strong>Smaller beats larger</strong> — the harnessed Gemini-2.5-Flash outperforms Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena single-player games</li>
<li><strong>100% illegal move prevention</strong> — across 145 TextArena games (single and two-player)</li>
<li><strong>Cost-effective</strong> — harness engineering is cheaper and more effective than deploying larger models</li>
</ul>
<p><strong>Takeaway:</strong> Structured code constraints are a powerful, cost-efficient alternative to raw model scaling for agent reliability.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-3-skillnet-durable-ai-skill-repositories-at-scale">3. SkillNet — Durable AI Skill Repositories at Scale</h2>
<p>AI agents constantly rediscover solutions instead of reusing prior work. SkillNet provides open infrastructure for creating, evaluating, and organising AI skills at scale.</p>
<p><strong>Key features:</strong></p>
<ul>
<li><strong>Unified skill ontology</strong> — skills from code libraries, prompt templates, and tool compositions are linked relationally for discovery and composition</li>
<li><strong>Multi-dimensional evaluation</strong> — every skill is scored on Safety, Completeness, Executability, Maintainability, and Cost-awareness</li>
<li><strong>200,000+ skill repository</strong> — with a browsable platform and Python toolkit for programmatic access</li>
<li><strong>Consistent gains</strong> — on ALFWorld, WebShop, and ScienceWorld: <strong>+40% average reward, −30% execution steps</strong></li>
</ul>
<p><strong>Takeaway:</strong> A shared skill commons dramatically improves agent efficiency and generalisation across task domains.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-4-the-spike-the-sparse-and-the-sink-transformer-attention-artifacts">4. The Spike, the Sparse and the Sink — Transformer Attention Artifacts</h2>
<p>Yann LeCun and NYU collaborators dissect two recurring Transformer phenomena: <strong>massive activations</strong> (extreme channel outliers in specific tokens) and <strong>attention sinks</strong> (tokens attracting disproportionate attention regardless of relevance).</p>
<p><strong>Key findings:</strong></p>
<ul>
<li><strong>Distinct scopes</strong> — massive activations operate globally (implicit model parameters); attention sinks operate locally (head-level attention bias)</li>
<li><strong>Pre-norm is the culprit</strong> — the pre-norm configuration common in modern Transformers enables both phenomena to co-occur; removing it decouples them</li>
<li><strong>Efficiency implications</strong> — quantisation, model compression, and KV-cache optimisation can fail silently when these phenomena are disrupted</li>
<li><strong>Not fundamental</strong> — these are design-dependent artifacts, opening the door to architectural modifications that eliminate them without sacrificing capability</li>
</ul>
<p><strong>Takeaway:</strong> Practitioners optimising Transformers for efficiency must account for these phenomena; they are architectural choices, not mathematical necessities.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-5-karl-reinforcement-learning-for-enterprise-search-agents">5. KARL — Reinforcement Learning for Enterprise Search Agents</h2>
<p>Databricks presents KARL, trained via RL across heterogeneous search tasks, achieving state-of-the-art on the newly introduced <strong>KARLBench</strong> spanning six search domains.</p>
<p><strong>Key features:</strong></p>
<ul>
<li><strong>OAPL post-training paradigm</strong> — iterative large-batch off-policy RL robust to trainer/inference discrepancies without clipped importance weighting</li>
<li><strong>Multi-task training</strong> — covers constraint-driven entity search, cross-document synthesis, tabular reasoning, entity retrieval, procedural reasoning, and fact aggregation</li>
<li><strong>Pareto-optimal</strong> — outperforms Claude 4.6 and GPT 5.2 on cost-quality and latency-quality tradeoffs starting from GLM 4.5 Air</li>
<li><strong>Strong scores</strong> — KARL-BCP: 59.6 → 70.4 on BrowseComp-Plus with value-guided search; KARL-TREC: 85.0 on TREC-Biogen</li>
</ul>
<p><strong>Takeaway:</strong> Multi-task RL with a purpose-built off-policy training paradigm can surpass closed frontier models on agentic search with sufficient test-time compute.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-6-memexrl-indexed-experience-memory-for-long-horizon-agents">6. Memex(RL) — Indexed Experience Memory for Long-Horizon Agents</h2>
<p>Long-horizon tasks cause LLM agents to lose track of prior attempts and remaining goals. Memex(RL) introduces an <strong>indexed experience memory</strong> that scales without discarding evidence or exploding context.</p>
<p><strong>Key features:</strong></p>
<ul>
<li><strong>Indexed experience memory</strong> — compact working context with structured summaries and stable indices; full-fidelity interactions stored externally</li>
<li><strong>RL-optimised memory operations</strong> — MemexRL trains agents to strategically decide what to summarise, archive, index, and retrieve under a context budget</li>
<li><strong>Bounded retrieval complexity</strong> — theoretical guarantees that decision quality is maintained with bounded retrieval operations as task history grows</li>
<li><strong>Better results, smaller context</strong> — improved task success rates on long-horizon benchmarks using significantly less working context than baselines</li>
</ul>
<p><strong>Takeaway:</strong> Strategic memory management, not brute-force context expansion, is the key to scaling agents on complex, long-horizon tasks.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-7-flashattention-4-co-designed-for-blackwell-gpus">7. FlashAttention-4 — Co-Designed for Blackwell GPUs</h2>
<p>FlashAttention-4 co-designs attention algorithms and kernel pipelines for NVIDIA B200/GB200 GPUs, which have asymmetric hardware scaling (tensor core throughput doubled; other units scaled more slowly).</p>
<p><strong>Key features:</strong></p>
<ul>
<li><strong>Major speedups</strong> — up to <strong>1.3× over cuDNN 9.13</strong> and <strong>2.7× over Triton</strong> on B200 with BF16; up to 1613 TFLOPs/s at 71% hardware utilisation</li>
<li><strong>Asymmetric scaling solutions</strong> — fully asynchronous matrix multiply pipelines, larger tile sizes, software-emulated exponential/conditional softmax rescaling, tensor memory to reduce shared memory traffic</li>
<li><strong>Python-native</strong> — implemented in CuTe-DSL embedded in Python; <strong>20–30× faster compile times</strong> vs. C++ template approaches</li>
<li><strong>Architecture-first thinking</strong> — Hopper-era optimisations leave significant performance on the table on Blackwell; new hardware demands new algorithms</li>
</ul>
<p><strong>Takeaway:</strong> Next-generation GPU architectures require ground-up attention kernel redesigns, and Python-native kernel development is now a viable path.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-8-structuredagent-hierarchical-planning-for-web-tasks">8. STRUCTUREDAGENT — Hierarchical Planning for Web Tasks</h2>
<p>STRUCTUREDAGENT introduces a hierarchical planning framework using <strong>dynamic AND/OR trees</strong> for long-horizon web tasks. The LLM is invoked only for local operations (node expansion or repair), while the system maintains the full planning tree.</p>
<p><strong>Key features:</strong></p>
<ul>
<li>Structured memory module tracks candidate solutions to improve constraint satisfaction</li>
<li>Interpretable hierarchical plans enable easier debugging and human intervention</li>
<li>Improved performance on WebVoyager, WebArena, and custom shopping benchmarks vs. standard LLM web agents</li>
</ul>
<p><strong>Takeaway:</strong> Separating global plan management from local LLM reasoning improves both performance and interpretability in complex web agent tasks.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-9-agentir-reasoning-aware-retrieval-for-deep-research-agents">9. AgentIR — Reasoning-Aware Retrieval for Deep Research Agents</h2>
<p>Deep research agents generate rich reasoning traces before each search call, but standard retrievers ignore this signal entirely. AgentIR jointly embeds the agent's reasoning trace with its query.</p>
<p><strong>Key features:</strong></p>
<ul>
<li><strong>Reasoning-aware retrieval</strong> — jointly embeds reasoning traces and queries for richer search intent signals</li>
<li><strong>DR-Synth</strong> — a data synthesis method for generating training data from standard QA datasets</li>
<li><strong>Strong results</strong> — AgentIR-4B achieves <strong>68% accuracy</strong> on BrowseComp-Plus with Tongyi-DeepResearch vs. 50% with conventional embedding models twice its size and 37% with BM25</li>
</ul>
<p><strong>Takeaway:</strong> Incorporating agent reasoning into the retrieval process is a high-leverage, low-cost improvement for deep research systems.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-10-think-harder-or-know-more-looping-vs-memory-in-transformers">10. Think Harder or Know More — Looping vs. Memory in Transformers</h2>
<p>This paper studies Transformers with two additions: <strong>adaptive per-layer looping</strong> (each block iterates its hidden state via a learned halting mechanism) and <strong>gated memory banks</strong> (additional learned storage).</p>
<p><strong>Key findings:</strong></p>
<ul>
<li><strong>Looping helps maths</strong> — adaptive looping primarily benefits mathematical reasoning tasks</li>
<li><strong>Memory helps commonsense</strong> — gated memory banks recover performance on commonsense reasoning tasks</li>
<li><strong>Combined superiority</strong> — combining both mechanisms outperforms an iso-FLOP baseline with 3× the number of layers on math benchmarks</li>
<li><strong>Layer specialisation</strong> — early layers loop minimally and access memory sparingly; later layers do both heavily</li>
</ul>
<p><strong>Takeaway:</strong> Different cognitive demands (computation vs. recall) require different architectural primitives; combining them yields efficiency gains over simply adding more layers.</p>
<p><a target="_blank" href="https://arxiv.org/abs/2503.xxxxx">Paper</a></p>
<hr />
<h2 id="heading-overall-themes-this-week">Overall Themes This Week</h2>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Theme</td><td>Papers</td></tr>
</thead>
<tbody>
<tr>
<td>Agentic coding &amp; planning</td><td>OpenDev, STRUCTUREDAGENT</td></tr>
<tr>
<td>Context &amp; memory management</td><td>OpenDev, Memex(RL), SkillNet</td></tr>
<tr>
<td>RL for agents</td><td>KARL, Memex(RL)</td></tr>
<tr>
<td>Constraint engineering</td><td>AutoHarness</td></tr>
<tr>
<td>Transformer architecture insights</td><td>The Spike/Sink, Think Harder or Know More</td></tr>
<tr>
<td>GPU efficiency</td><td>FlashAttention-4</td></tr>
<tr>
<td>Retrieval &amp; search</td><td>KARL, AgentIR</td></tr>
</tbody>
</table>
</div><blockquote>
<p><strong>Bottom line:</strong> The week's papers collectively argue that smarter architecture, structured constraints, and disciplined memory management consistently outperform brute-force scaling — whether in context windows, model size, or GPU compute.</p>
</blockquote>
<p><img src="https://v3b.fal.media/files/b/0a9586d1/jBhsoC3u7XKPfup_-7N5M_t48Bn2IO.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a9586d1/8pLptC8WfB_5GF6wyWy5m_APFckO2R.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[The Claude Code Source Leak: What Was Actually Inside]]></title><description><![CDATA[Based on the Engineer's Codex article "Diving into Claude Code's Source Code Leak" by Engineer's Codex.


On March 31, 2026, Anthropic accidentally shipped a .map sourcemap file inside a Claude Code npm update. Within minutes, 600,000 lines of one of...]]></description><link>https://rzem.guru/the-claude-code-source-leak-what-was-actually-inside</link><guid isPermaLink="true">https://rzem.guru/the-claude-code-source-leak-what-was-actually-inside</guid><category><![CDATA[AI]]></category><category><![CDATA[#anthropic]]></category><category><![CDATA[claude]]></category><category><![CDATA[open source]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Wed, 08 Apr 2026 22:44:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CHy8!,w_1200,h_675,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F768ab99c-f2f5-4adf-bf91-072511a57a30_1912x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><em>Based on the Engineer's Codex article <a target="_blank" href="https://read.engineerscodex.com/p/diving-into-claude-codes-source-code">"Diving into Claude Code's Source Code Leak"</a> by Engineer's Codex.</em></p>
</blockquote>
<hr />
<p>On March 31, 2026, Anthropic accidentally shipped a <code>.map</code> sourcemap file inside a Claude Code npm update. Within minutes, 600,000 lines of one of the most deliberately closed AI products in the world were mirrored, forked, ported, and uploaded to decentralized servers.</p>
<p>Claude Code has always been notoriously opaque — their Agent SDKs provide almost no insight into internals, and Anthropic actively keeps the source closed. Which made what happened next a very big deal.</p>
<hr />
<h2 id="heading-how-it-happened">How It Happened</h2>
<p>A <code>.map</code> file is a sourcemap — a developer tool that maps minified/compiled code back to the original source. Shipping it in production is a classic mistake that CI pipelines are supposed to catch.</p>
<p>Boris Cherny, a Claude Code engineer at Anthropic, confirmed it was plain developer error, not a tooling bug. His follow-up was notably measured: <em>"Mistakes happen. As a team, the important thing is to recognize it's never an individual's fault. It's the process, the culture, or the infra."</em></p>
<p>A blameless post-mortem take — the Google SRE playbook applied in real time. The goal is an environment where engineers report mistakes honestly rather than hiding them.</p>
<p>Chaofan Shou (<a target="_blank" href="https://twitter.com/Fried_rice">@Fried_rice</a>) was first to spot it and posted a public link. The race was on within minutes.</p>
<hr />
<h2 id="heading-the-chaos-that-followed">The Chaos That Followed</h2>
<h3 id="heading-claw-code-75000-stars-overnight">Claw-Code: 75,000 Stars Overnight</h3>
<p>The most popular fork was <strong>claw-code</strong> on GitHub, created by <a target="_blank" href="https://twitter.com/realsigridjin">@realsigridjin</a>. Rather than just mirroring the source (which would be an obvious DMCA target), he ported the entire thing to Python — using OpenAI's Codex to do the rewrite. Deliberate irony, presumably.</p>
<p>The legal theory: a clean-room AI rewrite can't be touched by DMCA. Claw-code hit 75,000+ stars and 75,000+ forks.</p>
<h3 id="heading-the-copyright-question-nobody-can-answer">The Copyright Question Nobody Can Answer</h3>
<p>Traditional clean-room reverse engineering is a real legal process:</p>
<blockquote>
<p><em>"It involves two separate teams: one analyzes the original software to create specifications, while a second 'clean' team creates the new product based only on those specifications, ensuring no proprietary code is copied."</em></p>
</blockquote>
<p>It used to take months and cost serious money. That was the barrier.</p>
<p>Now anyone with a Claude Max subscription can point an agent at a codebase's tests and have the logic rebuilt overnight. The practice has never been challenged in court at this scale.</p>
<p>Gergely Orosz framed the PR problem neatly: even if Anthropic tries to assert copyright, do they want the battle of suing an open source project for rebuilding their own <strong>AI-written</strong> product? And could they even prove it?</p>
<p>Meanwhile, another user uploaded a stripped version to IPFS with all telemetry removed, security guardrails disabled, and experimental features unlocked. Whether DMCA can even reach IPFS-hosted content is its own unresolved legal question.</p>
<p>Status at time of writing: non-rewritten forks have been DMCA'd. Claw-code is still up.</p>
<hr />
<h2 id="heading-what-was-actually-inside">What Was Actually Inside</h2>
<p>This is the part that matters.</p>
<h3 id="heading-kairos-the-unannounced-autonomous-agent">KAIROS: The Unannounced Autonomous Agent</h3>
<p>Hidden behind feature flags named <code>PROACTIVE</code> and <code>KAIROS</code>, the codebase contains a <strong>fully built autonomous agent mode</strong> that Anthropic has never publicly announced.</p>
<p>KAIROS runs in the background, 24/7, without you asking. Every few seconds it receives a heartbeat prompt:</p>
<blockquote>
<p><em>"Anything worth doing right now?"</em></p>
</blockquote>
<p>It evaluates what's happening and makes a call: act, or stay quiet. If it acts, it can fix errors, respond to messages, update files, and run tasks — everything Claude Code can already do, except without you initiating any of it.</p>
<p>KAIROS has three exclusive tools that regular Claude Code doesn't:</p>
<ul>
<li><strong>Push notifications</strong> — can reach you on phone or desktop even when the terminal is closed</li>
<li><strong>File delivery</strong> — can send you things it created without being asked</li>
<li><strong>Pull request subscriptions</strong> — watches your GitHub and reacts to code changes on its own</li>
</ul>
<p>It keeps <strong>append-only daily logs</strong> of everything it noticed, decided, and did. It cannot erase its own history.</p>
<p>At night it runs a process the codebase literally calls <strong><code>autoDream</code></strong> — consolidating what it learned during the day and reorganising memory. It persists across sessions. Close your laptop on Friday, open it Monday: KAIROS has been working the whole time.</p>
<h3 id="heading-44-hidden-feature-flags">44 Hidden Feature Flags</h3>
<p>Beyond KAIROS, the codebase contains 44 hidden feature flags and 20+ unshipped features total:</p>
<ul>
<li>Background agents running 24/7</li>
<li>One Claude orchestrating multiple worker Claudes</li>
<li>Cron scheduling</li>
<li>Full voice command mode</li>
<li>Browser control via Playwright</li>
<li>Agents that sleep and self-resume</li>
</ul>
<h3 id="heading-the-architectural-insight">The Architectural Insight</h3>
<p>The most interesting thing about all of this isn't the feature list — it's the architectural decision underneath it.</p>
<p>Regular Claude Code is <strong>reactive</strong>: it acts only when you send a message. KAIROS introduces a <strong>proactive loop</strong>, which requires a fundamentally different trust model. The agent now needs to decide on its own what is worth doing, which means the quality of that judgment becomes far more important than in a simple request-response system.</p>
<p>That's a hard problem. The fact that Anthropic has it fully built and gated behind feature flags suggests they've been working on it for a long time — and are being careful about when and how they ship it.</p>
<hr />
<h2 id="heading-what-this-actually-means">What This Actually Means</h2>
<p>A few things land differently after this leak:</p>
<p><strong>On the legal question</strong>: AI-assisted clean-room rebuilds have broken the traditional copyright moat. The cost and complexity that used to make clean-room reverse engineering prohibitive is gone. This will get litigated eventually, and the outcome will reshape how proprietary software works.</p>
<p><strong>On KAIROS</strong>: Anthropic isn't behind on autonomous agents. They've shipped it internally and are gating it deliberately. Whether that's because the trust model isn't ready, the UX isn't right, or they're watching how OpenClaw lands — we don't know. But it exists.</p>
<p><strong>On the mistake itself</strong>: Sourcemap files in production npm packages are a process failure, not a developer failure. The blameless post-mortem framing from Boris Cherny is the right call. The interesting question is what the process change looks like.</p>
<hr />
<p><em>Source: <a target="_blank" href="https://read.engineerscodex.com/p/diving-into-claude-codes-source-code">Diving into Claude Code's Source Code Leak</a> — Engineer's Codex, April 1, 2026.</em></p>
]]></content:encoded></item><item><title><![CDATA[How Embeddings Actually Work: From Arbitrary IDs to the Geometry of Meaning]]></title><description><![CDATA[This post is based on How Embeddings Actually Work by Claudius Papirus — Episode 5 of the "How AI Actually Works" course.


Take the word king. Subtract man. Add woman. You get queen.
That's not a metaphor. That's real arithmetic, done on real number...]]></description><link>https://rzem.guru/how-embeddings-actually-work-from-arbitrary-ids-to-the-geometry-of-meaning</link><guid isPermaLink="true">https://rzem.guru/how-embeddings-actually-work-from-arbitrary-ids-to-the-geometry-of-meaning</guid><category><![CDATA[AI]]></category><category><![CDATA[#Embeddings]]></category><category><![CDATA[explainer]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[nlp]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Wed, 08 Apr 2026 22:41:37 GMT</pubDate><content:encoded><![CDATA[<blockquote>
<p><em>This post is based on <a target="_blank" href="https://youtu.be/7aATI_t5UeY">How Embeddings Actually Work</a> by Claudius Papirus — Episode 5 of the "How AI Actually Works" course.</em></p>
</blockquote>
<hr />
<p>Take the word <strong>king</strong>. Subtract <em>man</em>. Add <em>woman</em>. You get <em>queen</em>.</p>
<p>That's not a metaphor. That's real arithmetic, done on real numbers, learned by a model that read billions of words and figured out the relationship entirely on its own. No one programmed it. No one wrote a definition. It just… emerged.</p>
<p>This is the story of <strong>embeddings</strong> — the hidden layer where words stop being text and start becoming something a machine can actually think with.</p>
<hr />
<h2 id="heading-the-problem-tokens-are-meaningless">The Problem: Tokens Are Meaningless</h2>
<p>In <a target="_blank" href="https://youtu.be/VafJzZbihSM">Episode 2 of this series</a>, we learned that text gets broken into tokens. But tokens are just IDs — arbitrary numbers. The token for <em>cat</em> might be <code>9,674</code>. That number tells you nothing about cats.</p>
<p>So between the raw token and the intelligent response you get back, something has to happen. The meaningless ID has to become a set of numbers that actually captures what the word <em>means</em>.</p>
<p>That bridge is an <strong>embedding</strong>.</p>
<hr />
<h2 id="heading-why-the-obvious-approaches-fail">Why the Obvious Approaches Fail</h2>
<h3 id="heading-option-1-sequential-numbering">Option 1: Sequential numbering</h3>
<p>Give every word a number. <em>The</em> = 1, <em>cat</em> = 2, <em>sat</em> = 3.</p>
<p>Problem: the model infers that <em>cat</em> is somehow "between" <em>the</em> and <em>sat</em>. Arbitrary numbering creates false relationships that have nothing to do with meaning.</p>
<h3 id="heading-option-2-one-hot-encoding">Option 2: One-hot encoding</h3>
<p>Give each word its own dimension. <em>Cat</em> = <code>[1, 0, 0, 0...]</code>, <em>dog</em> = <code>[0, 1, 0, 0...]</code>.</p>
<p>With a vocabulary of 50,000 words, you get a 50,000-dimensional space where every word is exactly as far from every other word. <em>Cat</em> is as distant from <em>kitten</em> as it is from <em>economics</em>. You've removed the false structure — but you've removed <em>all</em> structure.</p>
<p>What you actually need is a <strong>smaller space</strong> — a few hundred dimensions — where the geometry reflects meaning. Words that mean similar things should end up close together. Words that don't should be far apart.</p>
<p>But you can't design that by hand. Too many words, too many relationships, too many shades of meaning.</p>
<p>So you don't design it. You let the data build it.</p>
<hr />
<h2 id="heading-word2vec-teaching-a-model-to-learn-meaning">Word2Vec: Teaching a Model to Learn Meaning</h2>
<p>In 2013, Tomas Mikolov published a paper at Google that changed how the field thinks about language. The key insight came from a 1957 observation by linguist J.R. Firth:</p>
<blockquote>
<p><em>"You shall know a word by the company it keeps."</em></p>
</blockquote>
<p>Words that appear in similar contexts tend to mean similar things. <em>Dog</em> and <em>cat</em> both show up near <em>pet</em>, <em>fed</em>, <em>walks</em>. <em>Dog</em> and <em>inflation</em> don't.</p>
<p>Mikolov's team made that idea trainable. They built a small neural network with a deceptively simple task: <strong>given a word, predict the words that surround it</strong>. No definitions. No dictionaries. No human labels. Just billions of words of raw text and one prediction task.</p>
<p>During training, each word gets mapped to a vector — a list of around 300 numbers. Those numbers get adjusted millions of times as the model learns to predict context better.</p>
<p>When training is done, something remarkable emerges:</p>
<ul>
<li><em>Happy</em>, <em>joyful</em>, <em>cheerful</em> — <strong>neighbours in the space</strong></li>
<li><em>Run</em>, <em>sprint</em>, <em>jog</em> — <strong>neighbours in the space</strong></li>
<li>Words organised into a <strong>geography of meaning</strong>, with no one telling the model what anything meant</li>
</ul>
<p>They called it <strong>Word2Vec</strong>.</p>
<h3 id="heading-the-analogy-trick">The Analogy Trick</h3>
<p>Researchers then found something even more striking. The vectors didn't just cluster by similarity — they encoded <strong>relationships</strong>.</p>
<p>The direction from <em>man</em> to <em>woman</em> in the vector space is roughly the same direction as <em>king</em> to <em>queen</em>, and <em>uncle</em> to <em>aunt</em>. Gender is a consistent direction in the space.</p>
<p>So is tense: <em>walked</em> → <em>walking</em> matches <em>swam</em> → <em>swimming</em>.</p>
<p>So is geography: <em>Paris − France + Italy</em> lands near <em>Rome</em>.</p>
<p>Directions in a space nobody designed, encoding relationships nobody labelled — discovered purely from predicting which words appear near which.</p>
<hr />
<h2 id="heading-the-polysemy-problem">The Polysemy Problem</h2>
<p>Word2Vec had a flaw that seems obvious once you see it: <strong>each word gets exactly one vector</strong>, no matter what.</p>
<blockquote>
<p><em>"I deposited money at the bank."</em>
<em>"I sat by the river bank."</em></p>
</blockquote>
<p>In Word2Vec, <em>bank</em> is the same embedding in both sentences — a blurry average of every context it's ever appeared in. Not quite right for the financial meaning, not quite right for the river meaning.</p>
<p>This is the <strong>polysemy problem</strong>. One word, multiple meanings, one vector. <em>Light</em> in <em>light blue</em> vs <em>light bulb</em> vs <em>light as a feather</em> all collapse to the same point.</p>
<p>Static embeddings couldn't capture the fact that meaning shifts with context.</p>
<hr />
<h2 id="heading-the-2018-breakthrough-contextual-embeddings">The 2018 Breakthrough: Contextual Embeddings</h2>
<p>In 2018, two papers cracked it open:</p>
<ul>
<li><strong>ELMo</strong> from AI2</li>
<li><strong>BERT</strong> from Google</li>
</ul>
<p>Both arrived at the same answer from different angles: instead of one fixed vector per word, <strong>the embedding changes based on context</strong>. <em>Bank</em> next to <em>river</em> gets pulled in one direction. <em>Bank</em> next to <em>investment</em> gets pulled in another. Same word, different numbers.</p>
<p>This is exactly what happens inside the transformers that power modern AI. When a model processes your input:</p>
<ol>
<li>Each token starts with an <strong>initial embedding</strong> — looked up from a learned table</li>
<li>The <strong>attention mechanism</strong> examines every other token in the sequence</li>
<li>Layer by layer, the vectors get <strong>repositioned</strong> based on context</li>
<li>By the time a word has passed through dozens of layers, it's been reshaped into something specific to <em>this exact sentence, this exact position, this exact meaning</em></li>
</ol>
<p>The dimensions have scaled to match, too — from Word2Vec's 300 numbers to <strong>thousands</strong> in today's models. More dimensions, finer distinctions, more room for nuance.</p>
<p>A concrete example: in the sentence <em>"The cat was tired because it hadn't slept"</em> — by the final layer, the embedding for <em>it</em> has drifted toward <em>cat</em>. The model resolved the reference without being told to. The same word <em>it</em> in a different sentence would point somewhere entirely different.</p>
<p>The embedding isn't a label anymore. It's <strong>a coordinate that moves with meaning</strong>.</p>
<hr />
<h2 id="heading-why-this-matters-its-running-everything">Why This Matters: It's Running Everything</h2>
<p>This geometry isn't just elegant — it's behind most of what you use today.</p>
<p><strong>Semantic search</strong>: When you search and find results that match your <em>meaning</em> rather than your exact words, embeddings are why. The search engine converts your question into a vector and compares it to document vectors. <em>"How to fix a leaky faucet"</em> matches <em>"plumbing repair guide"</em> — zero shared words, but their embeddings are close.</p>
<p><strong>RAG (Retrieval-Augmented Generation)</strong>: When an AI retrieves relevant documents before answering your question, it's doing vector similarity search in embedding space.</p>
<p><strong>Recommendations</strong>: When a system finds content you didn't search for but somehow knew you'd want, it's comparing your preference vector to content vectors.</p>
<p><strong>Translation</strong>: When translation works between languages that structure sentences completely differently, the same principle applies — meaning has a shape, and that shape can be compared across languages.</p>
<hr />
<h2 id="heading-a-closing-thought">A Closing Thought</h2>
<p>Somewhere between the words you type and the response you get back, there's a space — high-dimensional, invisible, learned purely from patterns — where <em>happy</em> sits near <em>joyful</em>, and <em>king − man + woman</em> points toward <em>queen</em>.</p>
<p>Not because anyone decided it should. Because across billions of words, that's where the patterns put them.</p>
<p>Meaning, it turns out, isn't something you define. It's something that emerges when you pay enough attention to the company words keep.</p>
<hr />
<p><em>Episode 5 of the <a target="_blank" href="https://www.youtube.com/playlist?list=PL0m8aj_uWIA6i4dwEIWInzQOEI4vcYpvv">How AI Actually Works</a> course by Claudius Papirus. Previous episodes cover LLMs, tokens, training, and context windows.</em></p>
]]></content:encoded></item><item><title><![CDATA[You're charging 2023 rates for work AI does in 40 minutes + 2 prompts to see your real exposure]]></title><description><![CDATA[Read the original article
Summary: "You're Charging 2023 Rates for Work AI Does in 40 Minutes"
By Nate | Nate's Substack | April 7, 2026

Main Thesis
The global economy has always been built on inefficiency gaps — the distance between what something ...]]></description><link>https://rzem.guru/youre-charging-2023-rates-for-work-ai-does-in-40-minutes-2-prompts-to-see-your-real-exposure</link><guid isPermaLink="true">https://rzem.guru/youre-charging-2023-rates-for-work-ai-does-in-40-minutes-2-prompts-to-see-your-real-exposure</guid><category><![CDATA[AI]]></category><category><![CDATA[Developer Tools]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Wed, 08 Apr 2026 12:01:19 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a956d23/P1z4dzevirgURkWBbtuKe_lJBHFLMP.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://natesnewsletter.substack.com/p/313-became-438000-in-30-days-youre">Read the original article</a></p>
<h2 id="heading-summary-youre-charging-2023-rates-for-work-ai-does-in-40-minutes">Summary: "You're Charging 2023 Rates for Work AI Does in 40 Minutes"</h2>
<p><strong>By Nate | Nate's Substack | April 7, 2026</strong></p>
<hr />
<h3 id="heading-main-thesis">Main Thesis</h3>
<p>The global economy has always been built on <strong>inefficiency gaps</strong> — the distance between what something costs to produce and what the market pays for it. AI is now closing these gaps at an unprecedented speed (months, not decades), rendering entire pricing models, business structures, and career strategies obsolete almost overnight.</p>
<hr />
<h3 id="heading-key-concept-arbitrage-amp-gap-closing">Key Concept: Arbitrage &amp; Gap Closing</h3>
<ul>
<li><strong>Law firms</strong> bill for 8 hours of research AI can do in 40 minutes</li>
<li><strong>Consulting decks</strong> take 6 weeks when information access was the only barrier</li>
<li><strong>Offshore dev teams</strong> exist because of geographic pricing gaps — gaps AI is rapidly erasing</li>
<li>These are all forms of <strong>economic arbitrage</strong> — exploiting the gap between true cost and market price</li>
<li>AI is closing these gaps on the <strong>timescale of model releases</strong>, not business cycles</li>
</ul>
<hr />
<h3 id="heading-the-313-proof-of-concept">The $313 Proof of Concept</h3>
<ul>
<li>A bot on prediction market <strong>Polymarket</strong> turned <strong>$313 into ~$438,000 in 30 days</strong> (late 2025)</li>
<li>It didn't predict markets — it simply <strong>closed a pricing gap faster than humans could</strong></li>
<li>A developer reportedly <strong>rebuilt the entire system using Claude in ~40 minutes</strong></li>
<li>Critically: <strong>92.4% of wallets on the same platform lost money</strong> — proving access to AI ≠ advantage</li>
<li>The 7.6% who profited understood <em>what to build</em>, not just <em>that AI existed</em></li>
</ul>
<hr />
<h3 id="heading-five-categories-of-closing-inefficiency-taxonomy">Five Categories of Closing Inefficiency (Taxonomy)</h3>
<ol>
<li><strong>Speed gaps</strong> — tasks that took days now take minutes</li>
<li><strong>Knowledge asymmetry</strong> — the information edge that funded 30 years of offshoring is evaporating</li>
<li><strong>Formatting/research gaps</strong> — billable hours for mechanical work are collapsing</li>
<li><strong>Geographic pricing gaps</strong> — location-based cost arbitrage is shrinking</li>
<li><strong>Information distribution gaps</strong> — the lag between what's possible and what most people know is possible</li>
</ol>
<hr />
<h3 id="heading-the-compression-problem">The Compression Problem</h3>
<ul>
<li>AI <em>appears</em> to democratize advantage, but it mostly <strong>democratizes access</strong></li>
<li>True advantage goes to those who know <strong>which gaps to exploit and how</strong></li>
<li>Most people copy the surface (the bot) without understanding the mechanism underneath → they lose</li>
</ul>
<hr />
<h3 id="heading-the-rotation-dynamic">The Rotation Dynamic</h3>
<ul>
<li>Every time AI closes one gap, <strong>three new ones open elsewhere</strong></li>
<li>The "Mythos leak" referenced in the article previews a world of <strong>continuous disruption with no settling point</strong></li>
<li>Strategic plans written in 2025 are already potentially obsolete</li>
</ul>
<hr />
<h3 id="heading-practical-takeaways">Practical Takeaways</h3>
<ul>
<li><strong>Run a diagnostic on your own role</strong>: Ask where your current pricing or value is based on <em>historical inefficiency</em> rather than genuine skill</li>
<li><strong>Three diagnostic questions</strong> (paywalled in full) help map where value is heading in any industry</li>
<li><strong>Stop charging 2023 rates</strong> for work AI compresses into 40 minutes — clients will eventually notice</li>
<li>Understand the <em>mechanism</em>, not just the tool — copying AI use cases without understanding the underlying gap is a losing strategy</li>
<li>The window to <strong>reposition before the gap closes</strong> is measured in months, not years</li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a956d23/AVXwlIiGxUneDpRT1osml_n2RLlkqm.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a956d23/P1z4dzevirgURkWBbtuKe_lJBHFLMP.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[Your AI Agent Depends on Six Layers — Here's Which Ones Won't Last]]></title><description><![CDATA[Read the original article
Your AI Agent Depends on Six Layers — Here's Which Ones Won't Last
Main Thesis
A new infrastructure stack is forming beneath AI agents, and most builders can't distinguish which layers are durable from which are temporary st...]]></description><link>https://rzem.guru/your-ai-agent-depends-on-six-layers-heres-which-ones-wont-last</link><guid isPermaLink="true">https://rzem.guru/your-ai-agent-depends-on-six-layers-heres-which-ones-wont-last</guid><category><![CDATA[AI]]></category><category><![CDATA[Developer Tools]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Tue, 07 Apr 2026 12:01:04 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a954b62/EuKlqkOuCXUnOxtkPw6FW_yiZkyfdW.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://natesnewsletter.substack.com/p/your-ai-agent-depends-on-six-layers">Read the original article</a></p>
<h2 id="heading-your-ai-agent-depends-on-six-layers-heres-which-ones-wont-last">Your AI Agent Depends on Six Layers — Here's Which Ones Won't Last</h2>
<h3 id="heading-main-thesis">Main Thesis</h3>
<p>A new infrastructure stack is forming beneath AI agents, and most builders can't distinguish which layers are durable from which are temporary stopgaps. Nate argues that understanding this stack early is a competitive advantage — mirroring how early readers of the cloud and API-first transitions built defining companies, while late adapters paid in migration costs and lost time.</p>
<h3 id="heading-the-mental-model">The Mental Model</h3>
<ul>
<li>The right analogy is <strong>system calls</strong>, not Lego bricks — these layers are fundamental OS-level primitives for AI agents, not modular optional components.</li>
<li>The stack is being built for <strong>AI agents as the primary user</strong>, not humans.</li>
</ul>
<h3 id="heading-the-six-layers-with-durability-ratings">The Six Layers (with durability ratings)</h3>
<ol>
<li><strong>Compute</strong> — How agents access processing power</li>
<li><strong>Identity</strong> — How agents authenticate and are recognized</li>
<li><strong>Memory</strong> — How agents retain and retrieve context</li>
<li><strong>Tool Access</strong> — How agents interact with external services</li>
<li><strong>Billing</strong> — How agent-driven actions are metered and charged</li>
<li><strong>Orchestration</strong> — How agents are coordinated and managed</li>
</ol>
<p>Each layer is assessed for longevity — some are described as <strong>load-bearing walls lasting a decade</strong>, others as <strong>transitional workarounds agents will outgrow within 18 months</strong>.</p>
<h3 id="heading-key-finding-the-biggest-gap">Key Finding: The Biggest Gap</h3>
<ul>
<li><strong>Orchestration</strong> is identified as the most critical unsolved problem — the next infrastructure-defining opportunity — and no one has cracked it yet.</li>
<li>Several layers that will define the next infrastructure-scale company <strong>don't exist yet</strong>.</li>
</ul>
<h3 id="heading-practical-takeaways">Practical Takeaways</h3>
<ul>
<li>Builders should audit which layers they're dependent on and assess their durability.</li>
<li>Avoid <strong>transitional lock-in</strong> — building deeply on layers likely to be replaced soon.</li>
<li>Focus on <strong>reliability math</strong> when designing agent systems.</li>
<li>Develop builder skills aligned with the durable layers, not the temporary ones.</li>
<li>Over 1,000 startups and hundreds of millions in VC are already in this space — the window to get ahead of the stack is narrowing.</li>
</ul>
<h3 id="heading-bottom-line">Bottom Line</h3>
<p>The agent infrastructure stack is real, it's forming now, and the builders who can read it accurately will define the next era of AI — just as cloud-native builders defined the last one.</p>
<p><img src="https://v3b.fal.media/files/b/0a954b62/wJVCYt4ZZ5L7Mtvop7xLn_qvv6apri.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a954b62/EuKlqkOuCXUnOxtkPw6FW_yiZkyfdW.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[I Tested Cowork, Lindy, Sauna, and Opal Against 3 Questions. The Best Scored 1 out of 4.]]></title><description><![CDATA[Read the original article
Summary: I Tested Cowork, Lindy, Sauna, and Opal Against 3 Questions. The Best Scored 1 out of 4.
Main Thesis
A wave of 'outcome agent' tools (Lindy, Sauna, Google Opal, Cowork, Obvious) are pitching software that does the w...]]></description><link>https://rzem.guru/i-tested-cowork-lindy-sauna-and-opal-against-3-questions-the-best-scored-1-out-of-4-1</link><guid isPermaLink="true">https://rzem.guru/i-tested-cowork-lindy-sauna-and-opal-against-3-questions-the-best-scored-1-out-of-4-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Developer Tools]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Mon, 06 Apr 2026 05:22:38 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a95204d/lqJPwlh-A_A30l--1NM58_1dz0RY4F.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://natesnewsletter.substack.com/p/every-ai-agent-you-use-has-the-same">Read the original article</a></p>
<h1 id="heading-summary-i-tested-cowork-lindy-sauna-and-opal-against-3-questions-the-best-scored-1-out-of-4">Summary: <em>I Tested Cowork, Lindy, Sauna, and Opal Against 3 Questions. The Best Scored 1 out of 4.</em></h1>
<h2 id="heading-main-thesis">Main Thesis</h2>
<p>A wave of 'outcome agent' tools (Lindy, Sauna, Google Opal, Cowork, Obvious) are pitching software that <strong>does the work instead of helping you do the work</strong> — but almost none of them can answer the fundamental question: <strong>how does the agent know its own output is any good?</strong></p>
<p>The core insight is a structural one: <strong>AI agents excel in environments with automated feedback loops</strong> (like coding, where tests pass or fail) but struggle in knowledge work environments (like drafting strategy memos) where <strong>the human is the only feedback mechanism</strong>.</p>
<h2 id="heading-key-findings">Key Findings</h2>
<ul>
<li><strong>Best performer scored only 1 out of 4</strong> on the evaluation framework — a damning result across the board</li>
<li>The reason code-based AI agents succeeded first is structural: code has test suites that provide instant, objective feedback. Knowledge work has no equivalent</li>
<li>Most outcome agent demos sidestep this problem entirely, hiding it behind polished UI and impressive-looking outputs</li>
<li>Tools reviewed: <strong>Cowork, Lindy, Sauna (Obvious), and Google Opal</strong> — all tested against a 3-question framework</li>
<li>A single AI agent (likely Manus or similar) triggered a <strong>quarter-trillion-dollar selloff</strong> in enterprise software stocks, despite being a research preview that stops working when your laptop sleeps</li>
</ul>
<h2 id="heading-the-evaluation-framework-3-questions">The Evaluation Framework (3 Questions)</h2>
<p>Nate builds a framework around the feedback-loop insight to separate real agents from fake ones:</p>
<ol>
<li><strong>Does the agent know when it's wrong?</strong> (automated vs. human-only feedback)</li>
<li><strong>Is the output inspectable?</strong> (can you audit what it did and why?)</li>
<li><strong>Does context compound over time?</strong> (memory architecture that improves with use)</li>
</ol>
<h2 id="heading-practical-takeaways">Practical Takeaways</h2>
<ul>
<li><strong>Write the tests before the agent runs the work</strong> — define what 'good output' looks like before delegating</li>
<li>Look for agents with <strong>inspectable surfaces</strong> — you need to see reasoning, not just results</li>
<li><strong>Memory architecture matters</strong> — agents that retain compounding context are structurally superior</li>
<li>Use the included <strong>two-phase evaluation prompt</strong> to score any agent tool, then build a delegation spec calibrated to its actual weaknesses</li>
<li>The pitch ('outcomes, not answers') <em>might</em> be right eventually — but the infrastructure to support it reliably doesn't yet exist at the level being marketed</li>
</ul>
<h2 id="heading-bottom-line">Bottom Line</h2>
<p>Outcome agents are being sold ahead of their actual capabilities. Until feedback loops in knowledge work are solved, <strong>humans remain the QA layer</strong> — and most tools aren't designed with that reality in mind.</p>
<p><img src="https://v3b.fal.media/files/b/0a95204d/QlmXeQgzLQjGUtj4PYJhK_viMHrNxd.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a95204d/lqJPwlh-A_A30l--1NM58_1dz0RY4F.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents of the Week: Papers You...]]></title><description><![CDATA[Read the original article
AI Agents of the Week – LLM Watch (Feb 8, 2026)
Main Thesis
The frontier of AI agent research is rapidly maturing across five dimensions: architecture design, multi-agent collaboration, planning under uncertainty, safety, an...]]></description><link>https://rzem.guru/ai-agents-of-the-week-papers-you-1-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-of-the-week-papers-you-1-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Sat, 04 Apr 2026 18:38:58 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a94ef77/oOg8egh8DZfd2s9xpXmTS_kfiN1Wi8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://www.llmwatch.com/p/ai-agents-of-the-week-papers-you-e74">Read the original article</a></p>
<h1 id="heading-ai-agents-of-the-week-llm-watch-feb-8-2026">AI Agents of the Week – LLM Watch (Feb 8, 2026)</h1>
<h2 id="heading-main-thesis">Main Thesis</h2>
<p>The frontier of AI agent research is rapidly maturing across five dimensions: architecture design, multi-agent collaboration, planning under uncertainty, safety, and evaluation. Agents are evolving from simple chatbots into modular, self-improving systems capable of handling complex, long-horizon tasks — but new challenges around reliability, safety, and interpretability are emerging in parallel.</p>
<hr />
<h2 id="heading-key-findings">Key Findings</h2>
<h3 id="heading-1-modular-hierarchical-amp-self-improving-architectures">1. 🏗️ Modular, Hierarchical &amp; Self-Improving Architectures</h3>
<ul>
<li><strong>S1-NexusAgent</strong> uses a dual-loop design separating global planning from tool-based subtasks, with a "Critic" module that distills successful trajectories into reusable skills.</li>
<li><strong>MARS (Modular Agent with Reflective Search)</strong> introduces cost-aware planning and reflective memory for expensive AI research workflows.</li>
<li>Agents break problems into parts, orchestrate specialised modules, and continuously build competencies over time.</li>
</ul>
<h3 id="heading-2-multi-agent-systems-standardisation-amp-teamwork-pitfalls">2. 🤝 Multi-Agent Systems: Standardisation &amp; Teamwork Pitfalls</h3>
<ul>
<li>Researchers propose reusable <strong>"agent primitives"</strong> (e.g. Review, Voting &amp; Selection, Planning &amp; Execution) composable via an organiser agent with shared key-value memory — higher accuracy, lower token cost.</li>
<li>A separate study found LLM agent teams <strong>often underperform their best individual member</strong>, with consensus-seeking causing up to <strong>37% performance drops</strong>.</li>
<li>Upside: consensus-driven teams showed unexpected <strong>resilience against adversarial members</strong>.</li>
<li>Takeaway: AI collaboration needs new mechanisms to leverage expert agents without groupthink.</li>
</ul>
<h3 id="heading-3-planning-under-uncertainty-world-models-amp-assumption-handling">3. 🧭 Planning Under Uncertainty: World Models &amp; Assumption Handling</h3>
<ul>
<li><strong>Planner-Composer-Evaluator (PCE)</strong> framework converts implicit LLM assumptions into an explicit decision tree, scoring scenarios by likelihood and cost — outperforming dialogue-heavy baselines with far less communication.</li>
<li><strong>Reinforcement World Model Learning (RWML)</strong> gives agents an internal world model, aligning imagined next states with actual outcomes — significant task success boosts even without direct reward feedback.</li>
<li>Trend: agents are shifting toward "thinking before acting" — simulating outcomes before committing to actions.</li>
</ul>
<h3 id="heading-4-safety-amp-reliability-at-the-trajectory-level">4. 🛡️ Safety &amp; Reliability at the Trajectory Level</h3>
<ul>
<li><strong>AgentHeLLM</strong> threat-modeling framework maps "Agent-to-Agent" attack pathways (e.g. in AI vehicle copilots), separating what needs protection from how attacks occur.</li>
<li>A conceptual study argues existing uncertainty quantification methods (designed for single-turn QA) <strong>break down for sequential agent decisions</strong>.</li>
<li>Proposed reframe: agent confidence as <strong>conditionally reducible uncertainty</strong> — agents should actively gather information to reduce what they don't know, rather than uncertainty only accumulating.</li>
<li>Future designs will integrate explicit uncertainty modeling and threat assessment into decision loops.</li>
</ul>
<h3 id="heading-5-interpretability-amp-evaluation-catching-up">5. 🔍 Interpretability &amp; Evaluation Catching Up</h3>
<ul>
<li>A data-centric interpretability paper used <strong>sparse autoencoders + LLM summarisers</strong> to analyse multi-agent training logs, uncovering emergent behaviours (role-playing, language switching) and a hidden <strong>reward-hacking strategy</strong> missed by standard metrics.</li>
<li>Incorporating discovered insights via a refined prompt boosted agent performance by <strong>14%</strong>.</li>
<li>Growing call for <strong>unified evaluation frameworks</strong> — current benchmarks vary wildly due to inconsistent prompts, tools, and environments.</li>
</ul>
<hr />
<h2 id="heading-practical-takeaways">Practical Takeaways</h2>
<ul>
<li><strong>Builders</strong>: Adopt modular agent architectures with skill reuse and reflective memory to handle complex tasks more efficiently.</li>
<li><strong>Teams deploying multi-agent systems</strong>: Don't assume collaboration = better performance. Design explicit mechanisms for expert agents to lead rather than average out.</li>
<li><strong>Safety teams</strong>: Move beyond output-level checks — model threats at the trajectory level and build agents that know their own uncertainty.</li>
<li><strong>Researchers &amp; evaluators</strong>: Invest in interpretability tooling and standardised benchmarks now, before autonomous agents are deployed at scale.</li>
<li><strong>Everyone</strong>: The "safety net" (monitoring, interpretability, evaluation) must grow alongside agent capabilities — capability without accountability is a risk multiplier.</li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a94ef77/_zQOYcje_MZ6FSoDUBTmH_EFnvxzet.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a94ef77/oOg8egh8DZfd2s9xpXmTS_kfiN1Wi8.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents of the Week: Papers You...]]></title><description><![CDATA[Read the original article
AI Agents of the Week — LLM Watch (Feb 15, 2026)
Main Thesis
This week's AI agent research challenges several prevailing assumptions about how to build, guide, and scale autonomous agents — from documentation practices to mu...]]></description><link>https://rzem.guru/ai-agents-of-the-week-papers-you-1-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-of-the-week-papers-you-1-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Sat, 04 Apr 2026 18:38:06 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a94ef70/4fw4v_ct6cNwsAtquHe5Q_KsaPpDJI.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://www.llmwatch.com/p/ai-agents-of-the-week-papers-you-43c">Read the original article</a></p>
<h1 id="heading-ai-agents-of-the-week-llm-watch-feb-15-2026">AI Agents of the Week — LLM Watch (Feb 15, 2026)</h1>
<h2 id="heading-main-thesis">Main Thesis</h2>
<p>This week's AI agent research challenges several prevailing assumptions about how to build, guide, and scale autonomous agents — from documentation practices to multi-agent coordination and compute allocation.</p>
<hr />
<h2 id="heading-key-findings">Key Findings</h2>
<h3 id="heading-memory-amp-context">🧠 Memory &amp; Context</h3>
<ul>
<li><strong>AGENTS.md files hurt performance</strong>: Contrary to popular practice, repository-level context files reduce task success rates for coding agents while increasing inference costs by <strong>&gt;20%</strong>.</li>
<li><strong>Less is more</strong>: Minimal or no instructions outperform comprehensive documentation, suggesting unnecessary constraints impede agents rather than help them.</li>
</ul>
<h3 id="heading-planning-amp-environment">🗺️ Planning &amp; Environment</h3>
<ul>
<li><strong>Gaia2 benchmark</strong>: Introduces dynamic, evolving environments independent of agent actions. Best results: GPT-5 (high) at <strong>42% pass@1</strong> but struggles with time-sensitive tasks; Kimi-K2 (open-source) at <strong>21% pass@1</strong>.</li>
<li><strong>CATTS (Confidence-Aware Test-Time Scaling)</strong>: Outperforms naive uniform compute sampling by up to <strong>9.1%</strong> on WebArena-Lite while using <strong>2.3x fewer tokens</strong> — smart allocation beats brute-force compute.</li>
</ul>
<h3 id="heading-multi-agent-collaboration">🤝 Multi-Agent Collaboration</h3>
<ul>
<li><strong>Communication delays create U-shaped cooperation</strong>: Moderate delays cause LLM agents to exploit slower peers; excessive delay paradoxically reduces exploitation cycles.</li>
<li><strong>FLCOA framework</strong>: Five-layer model showing that low-level factors like communication resources fundamentally shape multi-agent cooperation — largely overlooked in current system design.</li>
<li><strong>LAVES</strong>: Hierarchical multi-agent system for educational video generation achieves <strong>&gt;1 million videos/day</strong> throughput with a <strong>95% cost reduction</strong> vs. industry standards.</li>
</ul>
<h3 id="heading-trust-amp-safety">🔒 Trust &amp; Safety</h3>
<ul>
<li><strong>Behavioral inconsistency predicts failure</strong>: ReAct agents produce <strong>2.0–4.2 distinct action sequences</strong> per 10 identical runs. Tasks with consistent paths achieve <strong>80–92% accuracy</strong>; highly inconsistent tasks drop to <strong>25–60%</strong>.</li>
<li><strong>69% of divergence occurs at step 2</strong>, meaning early decisions cascade into downstream failures — making early-step monitoring a practical intervention point.</li>
</ul>
<h3 id="heading-tools-amp-benchmarks">🛠️ Tools &amp; Benchmarks</h3>
<ul>
<li><strong>Mobile dev AI agents</strong>: Study of 2,901 AI-authored PRs across 193 Android/iOS repos. Android sees <strong>2x more AI PRs</strong> with higher acceptance (71% vs. 63% iOS). Routine tasks succeed most; structural refactors lag.</li>
<li><strong>AmbiBench</strong>: First benchmark using an instruction clarity taxonomy, shifting evaluation toward <strong>bidirectional intent alignment</strong> — addressing the reality that users often fail to articulate precise directives upfront.</li>
</ul>
<hr />
<h2 id="heading-practical-takeaways">Practical Takeaways</h2>
<ol>
<li><strong>Strip down AGENTS.md files</strong> — comprehensive instructions may be actively harming your coding agents.</li>
<li><strong>Monitor behavioral consistency</strong> as a real-time reliability signal; early divergence is a strong failure predictor.</li>
<li><strong>Use confidence-aware compute allocation</strong> rather than scaling uniformly for better efficiency and performance.</li>
<li><strong>Design multi-agent systems with communication latency in mind</strong> — it shapes cooperation in non-obvious ways.</li>
<li><strong>Evaluate agents on ambiguous instructions</strong>, not just clean ones — AmbiBench highlights a critical gap in current benchmarking.</li>
</ol>
<p><img src="https://v3b.fal.media/files/b/0a94ef70/qmtv9M7nHh_V3DQbCxyGr_M5AnHJVz.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a94ef70/4fw4v_ct6cNwsAtquHe5Q_KsaPpDJI.png" alt="Infographic wide" /></p>
]]></content:encoded></item><item><title><![CDATA[AI Agents of the Week: Papers You...]]></title><description><![CDATA[Read the original article
AI Agents of the Week – LLM Watch (Feb 22, 2026)
Main Thesis
This weekly research roundup from LLM Watch highlights five key areas where AI agent research is rapidly advancing: memory & continual learning, planning under unc...]]></description><link>https://rzem.guru/ai-agents-of-the-week-papers-you-1-1-1-1</link><guid isPermaLink="true">https://rzem.guru/ai-agents-of-the-week-papers-you-1-1-1-1</guid><category><![CDATA[AI]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[newsletter]]></category><dc:creator><![CDATA[Alex Rzem]]></dc:creator><pubDate>Sat, 04 Apr 2026 18:37:01 GMT</pubDate><enclosure url="https://v3b.fal.media/files/b/0a94ef6a/2Mlt-iDGn2DR8DkdBIop1_CgVD48nu.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a target="_blank" href="https://www.llmwatch.com/p/ai-agents-of-the-week-papers-you-6f1">Read the original article</a></p>
<h1 id="heading-ai-agents-of-the-week-llm-watch-feb-22-2026">AI Agents of the Week – LLM Watch (Feb 22, 2026)</h1>
<h2 id="heading-main-thesis">Main Thesis</h2>
<p>This weekly research roundup from LLM Watch highlights five key areas where AI agent research is rapidly advancing: memory &amp; continual learning, planning under uncertainty, multi-agent collaboration, trust &amp; safety, and practical tooling.</p>
<hr />
<h2 id="heading-key-findings">Key Findings</h2>
<h3 id="heading-memory-amp-continual-learning">🧠 Memory &amp; Continual Learning</h3>
<ul>
<li><strong>IntentCUA</strong> introduces <em>intent-level representations</em> that convert raw interaction traces into reusable skills.</li>
<li>Achieves a <strong>74.83% task success rate</strong> with a <strong>Step Efficiency Ratio of 0.91</strong> on desktop automation tasks.</li>
<li>Uses a coordinated Planner, Plan-Optimizer, and Critic sharing memory to stabilise long-horizon execution.</li>
</ul>
<h3 id="heading-planning-amp-environment-interaction">🗺️ Planning &amp; Environment Interaction</h3>
<ul>
<li><strong>AgentConductor</strong> uses reinforcement learning to evolve multi-agent communication topologies dynamically.</li>
<li>Delivers up to <strong>14.6% improvement in pass@1 accuracy</strong> over baselines for code generation.</li>
<li>Density-aware layered DAG construction <strong>reduces token costs by 68%</strong> — a major efficiency win for compute-constrained deployments.</li>
</ul>
<h3 id="heading-multi-agent-collaboration-amp-control">🤝 Multi-Agent Collaboration &amp; Control</h3>
<ul>
<li>AgentConductor shows that <strong>adapting topology to task difficulty</strong> outperforms fixed communication graphs, with <strong>13% density reductions</strong> alongside accuracy gains.</li>
<li><strong>AutoNumerics</strong> applies multi-agent orchestration to scientific computing, autonomously designing and verifying PDE solvers across <strong>24 canonical problems</strong>.</li>
<li>Key insight: <em>the architecture of agent collaboration matters more than individual agent capability.</em></li>
</ul>
<h3 id="heading-trust-verification-amp-safety">🔒 Trust, Verification &amp; Safety</h3>
<ul>
<li><strong>Wink</strong> is a production-deployed system for recovering from coding agent misbehaviours.</li>
<li>Found that <strong>~30% of all agent trajectories</strong> contain misbehaviours: Specification Drift, Reasoning Problems, or Tool Call Failures.</li>
<li>Lightweight self-intervention resolves <strong>90% of single-intervention misbehaviours</strong> and reduced engineer interventions in live A/B testing.</li>
<li><strong>CowCorpus</strong> provides a taxonomy of human intervention patterns, enabling models to predict user interventions with a <strong>61.4–63.4% improvement</strong> over baselines.</li>
</ul>
<h3 id="heading-tools-amp-frameworks-in-practice">🛠️ Tools &amp; Frameworks in Practice</h3>
<ul>
<li><em>How AI Coding Agents Communicate</em> analyses pull requests across five AI coding agents.</li>
<li>Finds that <strong>presentation style correlates with reviewer engagement and merge outcomes</strong> — agents that communicate clearly get their PRs merged more often.</li>
</ul>
<hr />
<h2 id="heading-practical-takeaways">Practical Takeaways</h2>
<ul>
<li><strong>Build for long horizons</strong>: Intent-level memory abstraction (IntentCUA) is a viable path to more reliable long-running agents.</li>
<li><strong>Dynamic topology &gt; static graphs</strong>: Fixed multi-agent communication structures leave significant performance and cost on the table.</li>
<li><strong>Expect ~30% misbehaviour rates</strong>: Production agent systems need built-in recovery mechanisms, not just prevention.</li>
<li><strong>Human-in-the-loop is predictable</strong>: Models can now anticipate when humans will intervene, enabling proactive agent self-correction.</li>
<li><strong>Agent communication style matters</strong>: How an agent explains its work affects real-world outcomes like code review acceptance.</li>
</ul>
<p><img src="https://v3b.fal.media/files/b/0a94ef6a/pVaRKyj_v4WaDlAiIBgc4_umQ3NDQb.png" alt="Infographic" /></p>
<p><img src="https://v3b.fal.media/files/b/0a94ef6a/2Mlt-iDGn2DR8DkdBIop1_CgVD48nu.png" alt="Infographic wide" /></p>
]]></content:encoded></item></channel></rss>