In-context: June 9, 2025

Here’s a quick wrap of the three papers we found interesting over the last few weeks with some take home points.

0:35 - Superhuman performance of a large language model on the reasoning tasks of a physician

06:20 - MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

11:45 - Identifying and mitigating algorithmic bias in the safety net

Some resources and papers we discuss:

Brodeur, P.G. et al (2024). Superhuman performance of a large language model on the reasoning tasks of a physician. ArXiv, abs/2412.10849.

Bedi, S. et al (2025). MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks. https://arxiv.org/abs/2505.23802

Mackin, S., Major, V.J., Chunara, R. et al. Identifying and mitigating algorithmic bias in the safety net. npj Digit. Med. 8, 335 (2025). https://doi.org/10.1038/s41746-025-01732-w

Bias Mitigation Playbook

https://medium.com/data-science/reducing-ai-bias-with-rejection-option-based-classification-54fefdb53c2e

In-context: June 9, 2025

In-context: June 9, 2025

Some resources and papers we discuss:

Are benchmarks broken?

Introducing In-context