Models

Analyses were not run to provide direct comparisons between models and any figures here should be taken with an entire bucket of salt because these are not like-to-like. That said, the most general observation that search grounding significantly improved scores on these questions is likely supportable, as is the observations that even recent models struggled without search grounding.

Please note also there is a mismatch between the score (which was computed at analysis time by different models) and the scores on the individual report pages, which was recomputed using different criteria after the fact using a more consistent system of classification. In other words this is a mess that is good for raising questions but not as useful at answering them. Questions will be answered later by more focused runs.

Model Search Reports Avg Score Avg Errors
Gemini 3.0 Flash + Search Yes 1313 3.68 0.38
Gemini 2.0 Flash + Search Yes 3 6.33 0.33
Gemini 2.5 Flash + Search Yes 114 6.96 0.69
Claude 4.5 Haiku + Search Yes 30 7.73 0.77
Claude 4 Sonnet + Search Yes 21 8.62 0.67
Gemini 3.0 Flash No 330 9.5 1.16
Gemini 2.0 Flash No 87 17.01 1.54
Gemini 2.5 Flash No 49 22.27 2.41
Claude 4.5 Sonnet + Search Yes 2 26.0 1.0