Skip to main content

Command Palette

Search for a command to run...

AI Agents Weekly: Evaluating Agents

Updated
โ€ข2 min read
AI Agents Weekly: Evaluating Agents

Read the original article

AI Agents Weekly: Evaluating AGENTS.md & More

From Elvis Saravia's AI Newsletter โ€” February 28, 2026

Main Thesis

This issue covers a wide range of AI agent developments, with the headline story challenging a widely adopted practice: using repository-level context files (like AGENTS.md or CLAUDE.md) to guide coding agents. Counterintuitively, research shows these files may be doing more harm than good.


๐Ÿ”ฌ Key Finding: AGENTS.md Files Hurt Coding Agent Performance

Researchers from UIUC and Microsoft Research evaluated whether repository-level context files actually improve coding agent performance on SWE-bench benchmarks.

Surprising results:

  • โŒ Lower success rates โ€” Both LLM-generated and human-written context files caused agents to solve fewer tasks compared to agents given no repository context at all.
  • ๐Ÿ’ธ Higher inference costs โ€” Context files increased inference costs by over 20%.
  • ๐Ÿ” Broader but less effective exploration โ€” Agents with context files explored more (more testing, more file traversal), but the additional constraints made tasks harder, not easier.
  • โœ… Minimal is better โ€” The authors recommend context files describe only minimal requirements rather than comprehensive specifications, as unnecessary constraints actively hurt performance.

Practical takeaway: Developers should rethink how they write AGENTS.md, CLAUDE.md, and similar files โ€” focus on essential guardrails only, not exhaustive instructions.

Paper


StorySummary
Perplexity ComputerPerplexity launches a computer-use agent for end-to-end task automation
Google Nano Banana 2Google releases Nano Banana 2 model for free
Sakana AI Doc-to-LoRA & Text-to-LoRATools for fine-tuning models directly from documents or text
Notion Custom Agents 3.3Notion launches custom agent capabilities in version 3.3
Nous Research Hermes AgentOpen-source agent model released by Nous Research
GPT-5.3-CodexOpenAI makes GPT-5.3-Codex available to all developers
Mercury 2New reasoning diffusion LLM ships from Mercury
Qwen 3.5 Medium SeriesAlibaba drops a new medium-sized Qwen model series
Claude Code Auto-MemoryAnthropic ships auto-memory across sessions for Claude Code
RoguePilotSecurity vulnerability exposed in GitHub Copilot
Vercel Chat SDKVercel open-sources a Chat SDK for multi-platform bot development

๐Ÿ’ก Practical Takeaways

  1. Less is more when writing agent context files โ€” avoid over-specifying agent behaviour.
  2. Benchmark your context files โ€” don't assume that more instructions equals better agent performance.
  3. The AI tooling ecosystem is rapidly expanding across coding, browser automation, fine-tuning, and memory management.
  4. Security remains a concern as tools like RoguePilot highlight vulnerabilities in popular AI coding assistants.

Infographic

Infographic wide

More from this blog

A

AI with Alex & Angus

102 posts