LLM Watch Weekly: When Scale Isn't Enough (Feb 27 2026)

Originally published on LLM Watch by Pascal Biese — February 27, 2026.

Summary

This week's LLM Watch Weekly digs into three key papers that challenge conventional assumptions about scaling, fine-tuning, and RAG conversations — plus a selection of specialised benchmarks. The central theme: bigger isn't always better.

Scale Can't Overcome Pragmatics: Reporting Bias in Vision-Language Models

Scale Can't Overcome Pragmatics — The paper challenges the assumption that reasoning capabilities like counting, spatial relationships, negation, and temporal reasoning will emerge with scale.

The core problem: Humans don't caption images with information that's "visually obvious." A photo captioned "at the game today!" doesn't include counts, spatial prepositions, or temporal details — because the communicative purpose of captions is to add context, not describe the obvious. This "reporting bias" means even web-scale datasets systematically lack the annotations needed to supervise these reasoning skills.

Key findings:

Counting information appeared in fewer than 8% of captions across datasets
Spatial prepositions beyond "on" and "in" were rare
Testing across OpenCLIP, LLaVA-1.5, and Molmo confirmed: scaling along data size, model size, and language diversity did not help
But targeted annotation collection of tacit information substantially improved performance

Takeaway: If your application requires counting, spatial reasoning, or negation, expect failures regardless of model size. The fix is targeted data curation, not larger models.

Fine-Tuning Without Forgetting In-Context Learning

Fine-Tuning Without Forgetting ICL — Addresses the classic trade-off: fine-tuning improves zero-shot performance but degrades in-context learning (few-shot) ability.

The insight: Decomposing attention into query/key projections (which control where to look) vs. value matrices (which control what to extract). When you fine-tune all parameters, the optimization corrupts the representations that enable in-context learning. But restricting updates to only the value matrix preserves ICL while still achieving zero-shot improvements.

Key findings:

Value-matrix-only fine-tuning preserved 94% of original few-shot performance
Full parameter fine-tuning reduced few-shot accuracy by 23-31%
An auxiliary few-shot loss improved ICL on the fine-tuning task by 12% but degraded held-out tasks by 8-15%

Takeaway: Freeze query and key projections; update only value matrices. Easy to implement in standard training frameworks, no architectural changes needed.

MTRAG-UN: Multi-Turn RAG Conversations in the Wild

MTRAG-UN — A benchmark of 666 tasks, 2,800+ conversation turns across six domains, specifically targeting real-world RAG failure modes.

The failure modes under test:

Unanswerable questions (corpus lacks the answer)
Underspecified questions (multiple valid interpretations)
Non-standalone questions (require conversational context)
Unclear responses (ambiguous or incomplete outputs)

Key findings:

Even best models correctly identified unanswerable questions only 38% of the time — the rest hallucinated
Retrieval accuracy on non-standalone questions: 41% vs. 73% on standalone — a 32 percentage point drop
Models asked for clarification on underspecified questions only 11% of the time (defaulting to one interpretation without acknowledging ambiguity in 89% of cases)
By the 5th conversation turn, retrieval accuracy had dropped 18 percentage points from turn 1

Takeaway: Single-turn benchmark performance is deeply misleading for conversational RAG. Design explicitly for unanswerable detection, clarification requests, and context compression.

Additional Papers This Week

SC-Arena — Benchmark for single-cell biology reasoning. Models achieve 78% on cell type annotation but only 34% on perturbation prediction — confirming a pattern/mechanism split: models learn to recognize, not understand causally.

PATRA — Pattern-aware alignment for time series QA. Current text/image representations miss the structural patterns experts actually use.

pMoE — Prompting diverse MoE experts together for better visual adaptation.

MM-NeuroOnco — Multimodal benchmark for MRI-based brain tumor diagnosis.

SPM-Bench — Benchmarking LLMs for scanning probe microscopy.

SOTAlign — Semi-supervised alignment of vision and language models via optimal transport.

RhythmBERT — Self-supervised language model on ECG waveforms for heart disease detection.

InnerQ — Hardware-aware tuning-free KV cache quantization for LLMs.

Key Takeaways

Scaling doesn't fix data bias: Reporting bias in training data is a structural problem; targeted curation beats more of the same biased data
Fine-tune smarter: Freeze Q/K projections, only update value matrices to preserve few-shot flexibility
Conversational RAG is broken: Single-turn benchmarks hide catastrophic failure rates on realistic multi-turn interactions
Pattern ≠ mechanism: Models recognize patterns well; causal/mechanistic reasoning remains deeply limited across domains (VLMs, biology, time series)