Are benchmarks broken?

22 June

Ep. 10 Are benchmarks broken?

In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.

0:25 - Technical wrap: what are agents?

13:20 - What are benchmarks?

18:20 - Automated evaluation
20:10 - Benchmarks
37:45 - Human feedback
44:50 - LLM as judge

Read more about the projects we discuss here:

Meditron
How to join the MOOVE
Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper

Some resources and papers we discuss:

Moritz, M., Topol, E. & Rajpurkar, P. Coordinated AI agents for advancing healthcare. Nat. Biomed. Eng 9, 432–438 (2025). https://doi.org/10.1038/s41551-025-01363-2

Zou, J. & Topol, E. The rise of agentic AI teammates in medicine. The Lancet Digital Medicine. 405, p457, February 08, 2025. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(25)00202-8/abstract

Bedi S, Liu Y, Orr-Ewing L, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA. 2025;333(4):319–328. doi:10.1001/jama.2024.21700 https://jamanetwork.com/journals/jama/fullarticle/2825147

Arora, R.K., Wei, J., Hicks, R.S., Bowman, P., Quinonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., & Singhal, K. (2025). HealthBench: Evaluating Large Language Models Towards Improved Human Health. https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf

https://crfm.stanford.edu/helm/

It’s Time to Bench the Medical Exam Benchmark https://ai.nejm.org/doi/full/10.1056/AIe2401235

Alyssa Pradhan

Are benchmarks broken?

Ep. 10 Are benchmarks broken?

Some resources and papers we discuss:

In-context: June 9, 2025