AI Agents Weekly: Claude Sonnet

AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro & More
Overview
Elvis Saravia's AI Agents Weekly newsletter (Feb 21, 2026) covers a packed week of major AI releases and agent-focused developments, highlighting significant leaps in autonomous computer use, coding agents, and AI benchmarking.
๐ Top Stories (Accessible Content)
1. Claude Sonnet 4.6 โ Anthropic
Anthropic released Claude Sonnet 4.6 as the new default model for all Claude users on February 17, 2026.
- Computer Use Breakthrough: OSWorld benchmark scores jumped from 14.9% โ 72.5% (nearly 5x improvement), making it the most capable model for autonomous GUI-based agent workflows.
- 1M Token Context Window (Beta): Enables agents to process entire codebases, long documents, and multi-session histories without losing earlier context.
- User Preference: In blind A/B tests, users preferred Sonnet 4.6 over Sonnet 4.5 ~70% of the time, particularly for coding, instruction following, and nuanced reasoning.
- Pricing: $3/$15 per million input/output tokens โ cost-efficient for high-volume agent deployments.
Practical Takeaway: Sonnet 4.6 is a significant upgrade for anyone building agentic or computer-use workflows โ the 5x OSWorld improvement alone makes it a compelling default choice.
2. EVMBench โ AI Agents vs. Smart Contract Security
OpenAI and Paradigm introduced EVMBench, a benchmark evaluating AI agents on smart contract security tasks across 120 curated vulnerabilities from 40 audits.
- Three Tasks: Detect, patch, and exploit high-severity smart contract vulnerabilities.
- Exploit-First Strength: Agents perform best at exploitation (where the goal is clear โ drain funds) but struggle with exhaustive detection and patching tasks.
- Real-World Sources: Scenarios sourced from open code audit competitions and the Tempo blockchain security auditing platform (a purpose-built L1 for high-throughput stablecoin payments).
- Key Limitation: Agents often stop after finding a single vulnerability rather than auditing comprehensively โ a critical gap for security-critical deployments.
Practical Takeaway: AI agents show promise for exploit discovery but are not yet reliable for full-coverage security auditing. Human oversight remains essential in smart contract security workflows.
๐ฐ Other Headlines (Paywalled)
The following stories are referenced but behind the paywall:
- Gemini 3.1 Pro โ Google launches with 77% ARC-AGI-2 score
- Stripe Minions โ Coding agents deployed at scale
- Cloudflare Code Mode MCP โ Claims 99.9% token savings
- Qwen 3.5 โ Alibaba drops agentic vision model
- ggml.ai joins Hugging Face โ Local AI integration
- Anthropic measures AI agent autonomy in practice
- AI agent autonomously publishes a hit piece
- dmux โ Multiplexes AI coding agents in parallel
- New benchmarks for agent memory and reliability
๐ง Key Themes This Week
- Computer use agents are maturing fast โ Sonnet 4.6's OSWorld leap signals GUI automation is becoming production-ready.
- Security + AI agents โ EVMBench highlights both the promise and the gaps in AI-driven smart contract auditing.
- Cost-efficiency at scale โ Competitive pricing and token savings (Cloudflare's 99.9% claim) are central themes as agentic deployments scale.
- Parallelism & memory โ New tools (dmux) and benchmarks focus on running multiple agents reliably and with better memory.
๐ Papers Mentioned
- EVMBench is referenced via a blog post โ no direct arXiv link was accessible from the paywalled content.







