Are benchmarks broken?
Ep. 10 Are benchmarks broken?
In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.
0:25 - Technical wrap: what are agents?
13:20 - What are benchmarks?
18:20 - Automated evaluation
20:10 - Benchmarks
37:45 - Human feedback
44:50 - LLM as judge
Read more about the projects we discuss here:
Meditron
How to join the MOOVE
Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper
Some resources and papers we discuss:
Moritz, M., Topol, E. & Rajpurkar, P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng 9, 432–438 (2025). https://doi.org/10.1038/s41551-025-01363-2
Zou, J. & Topol, E. The rise of agentic AI teammates in medicine. The Lancet Digital Medicine. 405, p457, February 08, 2025. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(25)00202-8/abstract
Bedi S, Liu Y, Orr-Ewing L, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025;333(4):319–328. doi:10.1001/jama.2024.21700 https://jamanetwork.com/journals/jama/fullarticle/2825147
Arora, R.K., Wei, J., Hicks, R.S., Bowman, P., Quinonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025). HealthBench: Evaluating Large Language Models Towards Improved Human Health. https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf
https://crfm.stanford.edu/helm/
It’s Time to Bench the Medical Exam Benchmark https://ai.nejm.org/doi/full/10.1056/AIe2401235