AI Models
Models
Analyses were not run to provide direct comparisons between models and any figures here should be taken with an entire bucket of salt because these are not like-to-like. That said, the most general observation that search grounding significantly improved scores on these questions is likely supportable, as is the observations that even recent models struggled without search grounding.
Please note also there is a mismatch between the score (which was computed at analysis time by different models) and the scores on the individual report pages, which was recomputed using different criteria after the fact using a more consistent system of classification. In other words this is a mess that is good for raising questions but not as useful at answering them. Questions will be answered later by more focused runs.
| Model | Search | Reports | Avg Score | Avg Errors |
|---|---|---|---|---|
| Gemini 3.0 Flash + Search | Yes | 1313 | 3.68 | 0.38 |
| Gemini 2.0 Flash + Search | Yes | 3 | 6.33 | 0.33 |
| Gemini 2.5 Flash + Search | Yes | 114 | 6.96 | 0.69 |
| Claude 4.5 Haiku + Search | Yes | 30 | 7.73 | 0.77 |
| Claude 4 Sonnet + Search | Yes | 21 | 8.62 | 0.67 |
| Gemini 3.0 Flash | No | 330 | 9.5 | 1.16 |
| Gemini 2.0 Flash | No | 87 | 17.01 | 1.54 |
| Gemini 2.5 Flash | No | 49 | 22.27 | 2.41 |
| Claude 4.5 Sonnet + Search | Yes | 2 | 26.0 | 1.0 |