LLM Watch Weekly: When Scale Isn't...

LLM Watch Weekly: When Scale Isn't Enough
Main Thesis
This edition challenges the prevailing assumption that scaling AI models โ more data, bigger models โ automatically solves capability gaps. Across multiple research papers, a clearer picture emerges: targeted interventions beat raw scale, evaluation benchmarks are finally catching up to real-world complexity, and the tension between specialisation and flexibility remains a central open problem.
Key Research Summaries
1. ๐ท Vision-Language Models & Reporting Bias
Core finding: VLMs consistently fail at counting, spatial reasoning, negation, and temporal reasoning โ not because of insufficient scale, but because human captions systematically omit this information (reporting bias). People caption photos to add context, not to describe what's visually obvious.
- Counting info appeared in fewer than 8% of captions across datasets
- Scaling data size, model size, or language diversity did not improve these specific capabilities
- Fix: Curating annotations that explicitly capture tacit information (counts, spatial relationships, temporal details) produced substantial improvement
- Even synthetically generated data inherited reporting bias from human-produced training text
2. ๐ง Fine-Tuning Without Losing In-Context Learning
Core finding: Fine-tuning all attention parameters degrades few-shot/in-context learning ability. But updating only the value matrix (freezing query and key projections) preserves in-context learning while still achieving zero-shot gains.
- Full fine-tuning reduced few-shot accuracy by 23โ31%
- Value-matrix-only fine-tuning preserved 94% of original few-shot performance
- Theoretical explanation: Q/K projections govern where the model attends (in-context learning mechanism); V projections govern what is extracted (task-specific knowledge)
- Easy to implement in standard frameworks โ no architectural changes required
3. ๐ฌ Multi-Turn RAG Failures (MTRAG-UN Benchmark)
Core finding: Real conversational RAG fails badly on realistic edge cases that current single-turn benchmarks ignore.
- Models correctly identified unanswerable questions only 38% of the time (rest hallucinated)
- Retrieval accuracy on non-standalone questions: 41% vs. 73% on standalone questions
- Models asked for clarification on underspecified questions only 11% of the time
- By the 5th conversation turn, retrieval accuracy dropped 18 percentage points
- Benchmark (666 tasks, 2,800+ turns, 6 domains) is publicly available
4. ๐ฌ Single-Cell Biology Reasoning (SC-Arena)
Core finding: LLMs can describe cells reasonably well but fail at mechanistic/causal reasoning.
- Cell type annotation accuracy: ~78% for top models
- Perturbation prediction accuracy: only 34%
- Causal QA accuracy: below 30%
- Knowledge-augmented evaluation (using biological ontologies) achieved 0.89 correlation with expert biologists vs. 0.52 for string-matching
5. ๐ Time Series QA (PATRA)
Core finding: Explicitly extracting trend and seasonality components before language alignment outperforms end-to-end approaches, especially on complex reasoning.
- Trend identification: 91% vs. 84% for next-best
- Multi-step reasoning: 67% vs. 48% for baselines
- Balanced reward mechanism (weighting harder tasks more heavily in RL) prevented easy examples from crowding out complex reasoning development
Three Overarching Themes
| Theme | Insight |
| Limits of scale | Specific capability gaps require targeted data curation and architectural choices, not just more compute |
| Evaluation catching up | New benchmarks stress-test realistic, messy usage rather than clean single-turn averages |
| Specialisation vs. flexibility | Fine-tuning and domain adaptation trade general capability for task performance โ this tension remains unsolved |
Practical Takeaways
- VLM apps requiring counting, spatial reasoning, or negation should expect failures regardless of model size โ invest in targeted data collection
- Fine-tuning pipelines should freeze Q/K attention projections and update only value matrices to preserve few-shot flexibility
- RAG systems must explicitly handle unanswerable, underspecified, and non-standalone queries โ don't assume single-turn benchmark scores reflect real usage
- Time series analytics tools benefit from classical decomposition (trend/seasonality extraction) as a preprocessing step before LLM reasoning
- Mixed-difficulty training should use balanced reward weighting to prevent easy examples from dominating gradient updates








