AI Agents Weekly: Evaluating Agents

AI Agents Weekly: Evaluating AGENTS.md & More
From Elvis Saravia's AI Newsletter โ February 28, 2026
Main Thesis
This issue covers a wide range of AI agent developments, with the headline story challenging a widely adopted practice: using repository-level context files (like AGENTS.md or CLAUDE.md) to guide coding agents. Counterintuitively, research shows these files may be doing more harm than good.
๐ฌ Key Finding: AGENTS.md Files Hurt Coding Agent Performance
Researchers from UIUC and Microsoft Research evaluated whether repository-level context files actually improve coding agent performance on SWE-bench benchmarks.
Surprising results:
- โ Lower success rates โ Both LLM-generated and human-written context files caused agents to solve fewer tasks compared to agents given no repository context at all.
- ๐ธ Higher inference costs โ Context files increased inference costs by over 20%.
- ๐ Broader but less effective exploration โ Agents with context files explored more (more testing, more file traversal), but the additional constraints made tasks harder, not easier.
- โ Minimal is better โ The authors recommend context files describe only minimal requirements rather than comprehensive specifications, as unnecessary constraints actively hurt performance.
Practical takeaway: Developers should rethink how they write AGENTS.md, CLAUDE.md, and similar files โ focus on essential guardrails only, not exhaustive instructions.
๐ฐ Other Stories Covered (Paywalled)
| Story | Summary |
| Perplexity Computer | Perplexity launches a computer-use agent for end-to-end task automation |
| Google Nano Banana 2 | Google releases Nano Banana 2 model for free |
| Sakana AI Doc-to-LoRA & Text-to-LoRA | Tools for fine-tuning models directly from documents or text |
| Notion Custom Agents 3.3 | Notion launches custom agent capabilities in version 3.3 |
| Nous Research Hermes Agent | Open-source agent model released by Nous Research |
| GPT-5.3-Codex | OpenAI makes GPT-5.3-Codex available to all developers |
| Mercury 2 | New reasoning diffusion LLM ships from Mercury |
| Qwen 3.5 Medium Series | Alibaba drops a new medium-sized Qwen model series |
| Claude Code Auto-Memory | Anthropic ships auto-memory across sessions for Claude Code |
| RoguePilot | Security vulnerability exposed in GitHub Copilot |
| Vercel Chat SDK | Vercel open-sources a Chat SDK for multi-platform bot development |
๐ก Practical Takeaways
- Less is more when writing agent context files โ avoid over-specifying agent behaviour.
- Benchmark your context files โ don't assume that more instructions equals better agent performance.
- The AI tooling ecosystem is rapidly expanding across coding, browser automation, fine-tuning, and memory management.
- Security remains a concern as tools like RoguePilot highlight vulnerabilities in popular AI coding assistants.








