Which organization's large language model (LLM) will be ranked first as of 30 September 2025, according to MedHELM's medical domain LLM leaderboard?

Roche asks:

Which organization's large language model (LLM) will be ranked first as of 30 September 2025, according to MedHELM's medical domain LLM leaderboard?

Started Jun 04, 2025 09:00PM UTC
Closed Sep 30, 2025 07:01AM UTC

Challenges

Roche AI Challenge

Tags

Business Technology Health

MedHELM is "a comprehensive healthcare benchmark to evaluate language models on real-world clinical tasks" created by a collaboration between, among others, Stanford's Center for Research on Foundational Models (CRFM) (Stanford - MedHELM). The question will be suspended on 29 September 2025 and the outcome determined using the ranks as reported by MedHELM at approximately 5:00p.m. ET on 30 September 2025 (Stanford - MedHELM Leaderboard, see "Mean win rate" on the "Accuracy" sheet tabbed near the top of the page; this page will be used for resolution). As of 2 June 2025, DeepSeek was ranked first with its "R1" scoring 0.663, followed by OpenAI with its "o3-mini (2025-01-31)" scoring 0.641. In the event of a tie for first place by LLMs of different organizations, the LLM with the higher "Mean win rate" on the "Efficiency" sheet tabbed near the top of the page will be considered first. If the named source changes the way it presents the data, further instructions will be provided.

Confused? Check our FAQ or ask us for help. To learn more about Good Judgment and Superforecasting, click here.

To learn more about how you can become a Superforecaster, see here. For other posts from our Insights blog, click here.

The question closed "DeepSeek" with a closing date of 30 September 2025.

See our FAQ to learn about how we resolve questions and how scores are calculated.

Possible Answer	Correct?	Final Crowd Forecast
Anthropic (e.g., "Claude")		2%
DeepSeek		87%
Google (e.g., "Gemini")		3%
OpenAI (e.g., "03-mini," "GPT")		6%
Another organization		2%

Crowd Forecast Profile

Participation Level
Number of Forecasters	108
Average for questions older than 6 months: 154
Number of Forecasts	345
Average for questions older than 6 months: 458

Accuracy
Participants in this question vs. all forecasters	average

Most Accurate

Relative Brier Score

Ben-Wright

-0.429066

jarimustonen

-0.401397

sai_39

-0.332324

ALabinsky

-0.308156

SeveDB

-0.286414

Roche asks: