AI Agents Weekly: Claude Sonnet 4.6, Gemini 3.1 Pro & More

Overview

Elvis Saravia's AI Agents Weekly newsletter (Feb 21, 2026) covers a packed week of major AI releases and agent-focused developments, highlighting significant leaps in autonomous computer use, coding agents, and AI benchmarking.

🔑 Top Stories (Accessible Content)

1. Claude Sonnet 4.6 — Anthropic

Anthropic released Claude Sonnet 4.6 as the new default model for all Claude users on February 17, 2026.

Computer Use Breakthrough: OSWorld benchmark scores jumped from 14.9% → 72.5% (nearly 5x improvement), making it the most capable model for autonomous GUI-based agent workflows.
1M Token Context Window (Beta): Enables agents to process entire codebases, long documents, and multi-session histories without losing earlier context.
User Preference: In blind A/B tests, users preferred Sonnet 4.6 over Sonnet 4.5 ~70% of the time, particularly for coding, instruction following, and nuanced reasoning.
Pricing: $3/$15 per million input/output tokens — cost-efficient for high-volume agent deployments.

Practical Takeaway: Sonnet 4.6 is a significant upgrade for anyone building agentic or computer-use workflows — the 5x OSWorld improvement alone makes it a compelling default choice.

2. EVMBench — AI Agents vs. Smart Contract Security

OpenAI and Paradigm introduced EVMBench, a benchmark evaluating AI agents on smart contract security tasks across 120 curated vulnerabilities from 40 audits.

Three Tasks: Detect, patch, and exploit high-severity smart contract vulnerabilities.
Exploit-First Strength: Agents perform best at exploitation (where the goal is clear — drain funds) but struggle with exhaustive detection and patching tasks.
Real-World Sources: Scenarios sourced from open code audit competitions and the Tempo blockchain security auditing platform (a purpose-built L1 for high-throughput stablecoin payments).
Key Limitation: Agents often stop after finding a single vulnerability rather than auditing comprehensively — a critical gap for security-critical deployments.

Practical Takeaway: AI agents show promise for exploit discovery but are not yet reliable for full-coverage security auditing. Human oversight remains essential in smart contract security workflows.

📰 Other Headlines (Paywalled)

The following stories are referenced but behind the paywall:

Gemini 3.1 Pro — Google launches with 77% ARC-AGI-2 score
Stripe Minions — Coding agents deployed at scale
Cloudflare Code Mode MCP — Claims 99.9% token savings
Qwen 3.5 — Alibaba drops agentic vision model
ggml.ai joins Hugging Face — Local AI integration
Anthropic measures AI agent autonomy in practice
AI agent autonomously publishes a hit piece
dmux — Multiplexes AI coding agents in parallel
New benchmarks for agent memory and reliability

🧠 Key Themes This Week

Computer use agents are maturing fast — Sonnet 4.6's OSWorld leap signals GUI automation is becoming production-ready.
Security + AI agents — EVMBench highlights both the promise and the gaps in AI-driven smart contract auditing.
Cost-efficiency at scale — Competitive pricing and token savings (Cloudflare's 99.9% claim) are central themes as agentic deployments scale.
Parallelism & memory — New tools (dmux) and benchmarks focus on running multiple agents reliably and with better memory.