MedS-Bench

Leaderboard for Large Language Model in Medicine

What Is MedS-Ins

MedS-Bench is a comprehensive medical evaluation benchmark designed to assess the capabilities of various Large Language Models in the medical field beyond just multiple-choice questions. It encompasses 11 task categories and integrates 39 existing datasets. For each dataset, MedS-Bench transforms the data into formats suitable for LLMs. This process involves manually writing clear task definitions, known as Instructions, to guide the models in understanding and responding to the medical data effectively. Our leaderboard provide a comparative analysis for evaluating the versatility and depth of LLMs in handling complex, domain-specific challenges in medicine.

For more details about MedS-Bench, please refer to this paper:

Submission

We invite you to submit your model's results to our Leaderboard. The evaluation code is now available for your reference.


You can either submit your model for our team to evaluate or simply provide the CSV result file.
For detailed instructions, please refer to the detailed submission tutorial.


We strongly acknowledge any volunteer to contribute data into our dataset. You can check the following guideline to see how to contribute data and we will thank each participant in writing in next updating of our paper!

Overview Model Evaluation

The table shows the average performance of the model on each Benchmark task. The metrics used are Accuracy or BLEU/ROUGE. Note that the last two tasks(Fact Verification and NLI) do not appear in this table because they use multiple different metrics and cannot calculate the average score.

Method MCQA Text Summarization Information Extraction Concept Explanation Rationale NER Diagnosis Treatment planning Clinical Outcome Prediction Text classification(F1)
GPT-4 75.3 24.46/25.66 76.92 19.37/21.58 - 59.52 58.13 84.73 59.80 68.06
Claude-3.5 - 26.29/27.36 79.41 12.56/16.75 46.26/36.97 25.56 60.24 92.93 64.08 66.74
MEDITRON 54.9 7.15/15.54 67.52 8.51/18.90 29.01/25.86 - 29.53 68.27 50.14 23.70
InternLM2 - 13.87/17.79 79.11 11.53/17.01 35.65/32.04 45.69 35.20 62.33 55.58 31.09
Mistral 49.1 24.48/24.90 70.18 13.53/17.37 38.14/32.28 16.53 34.80 38.93 50.14 48.73
Llama 3 62.2 22.20/23.08 72.17 13.51/17.92 29.09/25.06 23.62 33.73 56.07 19.05 38.37
MMedIns-Llama3 63.9 46.82/48.38 83.77 34.43/37.47 46.90/34.54 79.29 97.53 98.47 63.35 86.66
Multilingual Multiple-choice Question-answering(Accuracy)

This table presents a comprehensive comparison of various models on widely used multiple-choice question benchmarks in the medical domain.

Method MedQA MedMCQA PubMedQA MMedBench(ZH) MMedBench(JA) MMedBench(FR) MMedBench(RU) MMedBench(ES) Avg.
GPT-3.5 57.7 72.7 53.8 52.3 34.6 32.5 66.4 66.1 54.5
GPT-4 85.8 72.3 70.0 75.1 72.9 56.6 83.6 85.7 75.3
MEDITRON 47.9 59.2 74.4 61.9 40.2 35.1 67.6 53.3 54.9
Internlm 2 - - - 77.6 47.7 41.0 68.4 59.6 -
Mistral 50.8 48.2 75.4 71.1 44.7 48.7 74.2 63.9 49.1
Llama 3 60.9 50.7 73.0 78.2 48.2 50.8 71.5 64.2 62.2
MMedIns-Llama 3 63.6 57.1 78.2 78.6 54.3 46.0 72.3 61.2 63.9
Text Summarization(BLEU/ROUGE)

This table showcases the performance of various models in text summarization tasks across different datasets, measured by BLEU and ROUGE metrics.

Method MedQSum RCT-Text MIMIC-CXR MIMICIV(Ultrasound) MIMICIV(CT) MIMICIV(MRI) Avg.
GPT-4 25.06/27.30 34.32/31.09 27.26/29.71 11.17/14.53 23.97/29.52 25.76/32.06 24.46/25.66
Claude-3.5 21.14/25.06 41.02/36.16 27.76/29.93 15.24/18.28 21.98/26.38 26.43/31.05 26.29/27.36
MEDITRON 15.64/23.14 -/16.44 -/16.50 -/6.07 16.30/23.93 20.11/27.98 -/15.54
InternLM 2 15.69/21.63 14.48/15.16 11.83/13.41 13.48/20.96 20.88/27.82 23.43/31.40 13.87/17.79
Mistral 23.49/26.03 27.24/26.13 22.09/24.71 25.09/22.72 27.60/30.77 29.87/31.81 24.48/24.90
Llama 3 22.45/25.08 15.38/14.60 32.92/32.64 18.06/20.00 24.47/29.35 24.82/30.50 22.20/23.08
MMedIns-Llama 3 54.16/56.95 57.82/55.60 54.91/57.64 20.40/23.32 42.18/46.46 40.53/43.38 46.82/48.38
Information Extraction(Accuracy)

This table showcases the performance of various models in Information Extraction tasks, using accuracy as the metric. In the table, the "Ext." represents extraction and "Info." denotes information.

Method PICO Participant Ext. PICO Intervention Ext. PICO Outcome Ext. ADE Drug Dose Ext. PMC-patient Basic Info. Ext. Avg.
GPT-4 67.44 62.79 65.12 91.30 97.93 76.92
Claude-3.5 65.12 76.74 60.47 95.65 99.07 79.41
MEDITRON 72.09 46.51 51.16 95.65 72.20 67.52
InternLM 2 72.09 74.42 69.77 95.65 83.60 79.11
Llama 3 58.14 79.07 58.14 69.57 95.93 72.17
Mistral 60.47 65.12 48.84 91.30 85.20 70.18
MMedIns-Llama 3 83.72 79.07 62.79 95.65 97.60 83.77
Concept Explanation(BLEU/ROUGE)

This table provides the BLEU and ROUGE metrics for various language models on medical concept explanation tasks.

Method Health Fact Exp. Do Entity Exp. BioLORD Concept Exp. Avg.
GPT-4 18.63/20.80 19.14/21.14 20.33/22.80 19.37/21.58
Claude-3.5 14.96/18.48 8.75/13.28 13.95/18.49 12.56/16.75
MEDITRON 6.09/8.65 7.68/25.39 11.76/22.66 8.51/18.90
InternLM 2 22.36/27.01 5.28/10.39 6.95/13.62 11.53/17.01
Llama 3 16.79/20.32 14.88/18.84 8.87/14.61 13.51/17.92
Mistral 18.11/21.31 9.21/14.11 13.27/16.68 13.53/17.37
MMedIns-Llama 3 30.50/28.53 34.66/39.99 38.12/43.90 34.43/37.47
Rationale(BLEU/ROUGE)

This table evaluates various language models using BLEU and ROUGE metrics on rationale generation tasks.

Method MMedBench(ZH) MMedBench(EN) MMedBench(FR) MMedBench(JA) MMedBench(RU) MMedBench(ES) Avg.
Claude-3.5 44.64/34.63 47.07/38.67 48.93/41.23 49.22/39.15 38.90/28.17 48.80/39.99 46.26/36.97
MEDITRON 20.39/21.79 38.42/31.24 34.43/29.33 18.89/24.98 24.32/16.77 37.64/31.01 29.01/25.86
InternLM 2 35.23/30.77 44.12/37.39 36.10/33.65 29.13/33.15 27.43/20.99 41.87/36.30 35.65/32.04
Llama 3 28.51/23.30 44.10/39.26 24.92/22.24 13.46/15.04 31.16/22.85 32.37/27.70 29.09/25.06
Mistral 35.53/28.91 47.20/37.88 39.53/35.64 29.16/28.96 32.15/23.99 45.27/38.33 38.14/32.28
MMedIns-Llama 3 50.27/34.01 49.08/38.19 46.93/38.73 51.74/35.19 35.27/23.81 48.15/37.35 46.90/34.54
Name Entity Recognition (F1)

This table displays the F1-Score metrics for various language models across multiple NER tasks, highlighting their ability to recognize chemical, disease, and organism entities.

Method BC4Chem BC5Chem BC5Disease Species800 Avg.
GPT-4 54.84 67.62 53.20 62.43 59.52
Claude-3.5 22.98 40.77 24.05 14.45 25.56
MEDITRON - - - - -
InternLM 2 41.21 41.51 37.11 62.93 45.69
Llama 3 19.45 37.83 25.30 11.90 23.62
Mistral 15.56 32.09 12.17 6.31 16.53
MMedIns-Llama 3 90.78 91.25 54.26 80.87 79.29
Diagnosis(Accuracy)

This table showcases the performance of various models in diagnosis. The dataset, DDXPlus utilizes a predefined list of diseases, from which models must select one based on the provided patient context. In this task, accuracy is used as the metric.

Method DDXPlus
GPT-4 58.13
Claude-3.5 60.24
MEDITRON 29.53
InternLM 2 35.20
Llama 3 33.73
Mistral 34.80
MMedIns-Llama 3 97.53
Treatment Planning(Accuracy)

This table showcases the performance of various models in treatment planning. In the dataset, SEER, \ treatment recommendations are categorized into eight high-level types. Accuracy is used as the metric.

Method SEER
GPT-4 84.73
Claude-3.5 92.93
MEDITRON 68.27
InternLM 2 62.33
Llama 3 56.07
Mistral 38.93
MMedIns-Llama 3 98.47
Clinical Outcome Prediction(Accuracy)

This table showcases the performance of various models in tlinical outcome prediction across different datasets.

Method MIMIC4ED(Hospitalization) MIMIC4ED(72h ED Revisit) MIMIC4ED(Critical Triage)
GPT-4 61.20 58.07 60.13
Claude-3.5 65.80 57.91 68.53
MEDITRON 56.27 48.47 45.67
InternLM 2 58.80 55.13 52.80
Llama 3 39.07 9.27 8.80
Mistral 56.27 48.47 45.67
MMedIns-Llama 3 74.20 52.73 63.13
Text Classification(P,R,F1)

This table showcases the performance of various models in Text Classification. The dataset used is HoC, which is designed for multi-label classsification. The metrics used are precision, recall, and F1 score.

Method HoC Precision HoC Recall HoC F1_score
GPT-4 61.07 80.23 68.06
Claude-3.5 58.43 79.84 66.74
MEDITRON 19.61 34.61 23.70
InternLM 2 20.65 82.24 31.09
Llama 3 32.40 52.03 38.37
Mistral 40.39 64.11 48.73
MMedIns-Llama 3 89.59 85.58 86.66
Fact Verification(Accuracy, BLEU/ROUGE)

This table details the performance of various models on fact verification. For PubMedQA Answer Verification and HealthFact Verification, the LLMs is tasked to select a single answer from a list of provided candidates, with accuracy employed as the evaluation metric. Conversely, for EBMS Justification Verification, which requires the LLMs to generate free-form text, performance is assessed using BLEU and ROUGE metrics.

Method PubMedQA Answer Ver. PUBLICHEALTH Health Fact Ver. EMBS Justification Ver.
GPT-4 66.15 78.60 16.28/16.27
Claude-3.5 11.54 62.04 14.77/16.45
MEDITRON 25.23 32.66 11.58/15.78
InternLM 2 99.23 76.94 8.75/14.69
Llama 3 94.77 63.89 16.52/16.49
Mistral 57.38 69.78 15.98/16.43
MMedIns-Llama 3 97.08 79.55 12.71/14.65
Natural Language Inference(Accuracy, BLEU/ROUGE)

The results are measured with accuracy for the discriminative tasks~(selecting the right answer from a list of candidates) and BLEU/ROUGE metrics for the generative tasks~(generating free-form text answers).

Method MedNLI Discriminative Task MedNLI Generative Task
GPT-4 86.63 27.09/23.71
Claude-3.5 82.14 17.80/20.02
MEDITRON 60.83 4.42/14.08
InternLM 2 84.67 15.84/19.01
Llama 3 63.85 21.31/22.75
Mistral 71.59 13.03/15.47
MMedIns-Llama 3 86.71 23.52/25.17