MedS-Bench

Leaderboard for Large Language Model in Medicine

What Is MedS-Ins

MedS-Bench is a comprehensive medical evaluation benchmark designed to assess the capabilities of various Large Language Models in the medical field beyond just multiple-choice questions. It encompasses 11 task categories and integrates 39 existing datasets. For each dataset, MedS-Bench transforms the data into formats suitable for LLMs. This process involves manually writing clear task definitions, known as Instructions, to guide the models in understanding and responding to the medical data effectively. Our leaderboard provide a comparative analysis for evaluating the versatility and depth of LLMs in handling complex, domain-specific challenges in medicine.

For more details about MedS-Bench, please refer to this paper:

Towards Evaluating and Building Versatile Large Language Models for Medicine (Arxiv)

Submission

We invite you to submit your model's results to our Leaderboard. The evaluation code is now available for your reference.

Evaluation Code

You can either submit your model for our team to evaluate or simply provide the CSV result file.
For detailed instructions, please refer to the detailed submission tutorial.

Submission Tutorial

We strongly acknowledge any volunteer to contribute data into our dataset. You can check the following guideline to see how to contribute data and we will thank each participant in writing in next updating of our paper!

Data Contribution

Overview Model Evaluation

The table shows the average performance of the model on each Benchmark task. The metrics used are Accuracy or BLEU/ROUGE. Note that the last two tasks(Fact Verification and NLI) do not appear in this table because they use multiple different metrics and cannot calculate the average score.

Method	MCQA	Text Summarization	Information Extraction	Concept Explanation	Rationale	NER	Diagnosis	Treatment planning	Clinical Outcome Prediction	Text classification(F1)
GPT-4	75.3	24.46/25.66	76.92	19.37/21.58	-	59.52	58.13	84.73	59.80	68.06
Claude-3.5	-	26.29/27.36	79.41	12.56/16.75	46.26/36.97	25.56	60.24	92.93	64.08	66.74
MEDITRON	54.9	7.15/15.54	67.52	8.51/18.90	29.01/25.86	-	29.53	68.27	50.14	23.70
InternLM2	-	13.87/17.79	79.11	11.53/17.01	35.65/32.04	45.69	35.20	62.33	55.58	31.09
Mistral	49.1	24.48/24.90	70.18	13.53/17.37	38.14/32.28	16.53	34.80	38.93	50.14	48.73
Llama 3	62.2	22.20/23.08	72.17	13.51/17.92	29.09/25.06	23.62	33.73	56.07	19.05	38.37
Qwen 2	-	18.41/19.13	67.46	9.20/12.67	37.89/30.65	35.55	34.07	22.27	55.93	40.29
Med42-v2	-	20.17/21.17	72.74	14.57/18.00	37.89/30.65	24.73	34.13	43.87	53.22	47.87
Baichuan 2	-	16.13/17.66	48.58	12.93/15.77	25.70/22.18	19.17	19.20	16.80	53.22	23.76
Llama3.1-Aloe-Beta-8B	-	25.05/25.09	71.32	13.63/17.80	43.51/35.87	59.71	60.67	66.67	30.22	73.45
Llama3.1-Aloe-Beta-70B	-	25.48/24.52	73.91	13.63/17.80	47.94/37.75	35.71	60.00	68.00	51.33	62.90
MMedIns-Llama3	63.9	46.82/48.38	83.77	34.43/37.47	46.90/34.54	79.29	97.53	98.47	63.35	86.66

Multilingual Multiple-choice Question-answering(Accuracy)

This table presents a comprehensive comparison of various models on widely used multiple-choice question benchmarks in the medical domain.

Method	MedQA	MedMCQA	PubMedQA	MMedBench(ZH)	MMedBench(JA)	MMedBench(FR)	MMedBench(RU)	MMedBench(ES)	Avg.
GPT-3.5	57.7	72.7	53.8	52.3	34.6	32.5	66.4	66.1	54.5
GPT-4	85.8	72.3	70.0	75.1	72.9	56.6	83.6	85.7	75.3
MEDITRON	47.9	59.2	74.4	61.9	40.2	35.1	67.6	53.3	54.9
Internlm 2	-	-	-	77.6	47.7	41.0	68.4	59.6	-
Mistral	50.8	48.2	75.4	71.1	44.7	48.7	74.2	63.9	49.1
Llama 3	60.9	50.7	73.0	78.2	48.2	50.8	71.5	64.2	62.2
Qwen 1.5	48.9	50.2	67.8	-	-	-	-	-	-
Med42-v2	62.8	62.8	75.8	-	-	-	-	-	-
Baichuan 2	32.7	41.7	-	-	-	-	-	-	-
Llama3.1-Aloe-Beta-8B	64.65	59.81	-	-	-	-	-	-	-
Llama3.1-Aloe-Beta-70B	79.50	72.22	-	-	-	-	-	-	-
MMedIns-Llama 3	63.6	57.1	78.2	78.6	54.3	46.0	72.3	61.2	63.9

Text Summarization(BLEU/ROUGE)

This table showcases the performance of various models in text summarization tasks across different datasets, measured by BLEU and ROUGE metrics.

Method	MedQSum	RCT-Text	MIMIC-CXR	MIMICIV(Ultrasound)	MIMICIV(CT)	MIMICIV(MRI)	Avg.
GPT-4	25.06/27.30	34.32/31.09	27.26/29.71	11.17/14.53	23.97/29.52	25.76/32.06	24.46/25.66
Claude-3.5	21.14/25.06	41.02/36.16	27.76/29.93	15.24/18.28	21.98/26.38	26.43/31.05	26.29/27.36
MEDITRON	15.64/23.14	-/16.44	-/16.50	-/6.07	16.30/23.93	20.11/27.98	-/15.54
InternLM 2	15.69/21.63	14.48/15.16	11.83/13.41	13.48/20.96	20.88/27.82	23.43/31.40	13.87/17.79
Mistral	23.49/26.03	27.24/26.13	22.09/24.71	25.09/22.72	27.60/30.77	29.87/31.81	24.48/24.90
Llama 3	22.45/25.08	15.38/14.60	32.92/32.64	18.06/20.00	24.47/29.35	24.82/30.50	22.20/23.08
Qwen 2	16.41/19.29	25.37/24.20	14.64/14.87	17.20/18.15	18.38/22.36	22.80/26.67	18.41/19.13
Med42-v2	12.34/18.61	36.45/32.99	16.82/16.87	15.07/16.21	23.13/24.80	27.44/28.78	20.17/21.17
Baichuan 2	12.74/17.30	23.32/24.09	13.55/12.94	14.91/16.31	19.68/20.92	25.21/26.25	16.13/17.66
Llama3.1-Aloe-Beta-8B	15.41/16.51	36.42/33.13	31.49/31.81	16.89/18.90	12.78/19.36	13.79/20.27	25.05/25.09
Llama3.1-Aloe-Beta-70B	11.07/14.21	38.58/32.57	30.85/30.35	21.41/20.94	13.96/17.83	16.36/20.47	25.48/24.52
MMedIns-Llama 3	54.16/56.95	57.82/55.60	54.91/57.64	20.40/23.32	42.18/46.46	40.53/43.38	46.82/48.38

Information Extraction(Accuracy)

This table showcases the performance of various models in Information Extraction tasks, using accuracy as the metric. In the table, the "Ext." represents extraction and "Info." denotes information.

Method	PICO Participant Ext.	PICO Intervention Ext.	PICO Outcome Ext.	ADE Drug Dose Ext.	PMC-patient Basic Info. Ext.	Avg.
GPT-4	67.44	62.79	65.12	91.30	97.93	76.92
Claude-3.5	65.12	76.74	60.47	95.65	99.07	79.41
MEDITRON	72.09	46.51	51.16	95.65	72.20	67.52
InternLM 2	72.09	74.42	69.77	95.65	83.60	79.11
Llama 3	58.14	79.07	58.14	69.57	95.93	72.17
Mistral	60.47	65.12	48.84	91.30	85.20	70.18
Qwen 2	58.14	67.44	41.86	73.91	95.93	67.46
Med42-v2	55.81	60.47	60.47	91.30	95.67	72.74
Baichuan 2	48.84	34.88	16.28	69.57	73.33	48.58
Llama3.1-Aloe-Beta-8B	58.14	72.09	60.47	73.91	92.00	71.32
Llama3.1-Aloe-Beta-70B	69.77	72.09	62.79	69.57	95.33	73.91
MMedIns-Llama 3	83.72	79.07	62.79	95.65	97.60	83.77

Concept Explanation(BLEU/ROUGE)

This table provides the BLEU and ROUGE metrics for various language models on medical concept explanation tasks.

Method	Health Fact Exp.	Do Entity Exp.	BioLORD Concept Exp.	Avg.
GPT-4	18.63/20.80	19.14/21.14	20.33/22.80	19.37/21.58
Claude-3.5	14.96/18.48	8.75/13.28	13.95/18.49	12.56/16.75
MEDITRON	6.09/8.65	7.68/25.39	11.76/22.66	8.51/18.90
InternLM 2	22.36/27.01	5.28/10.39	6.95/13.62	11.53/17.01
Llama 3	16.79/20.32	14.88/18.84	8.87/14.61	13.51/17.92
Mistral	18.11/21.31	9.21/14.11	13.27/16.68	13.53/17.37
Qwen 2	14.94/17.45	5.87/9.73	6.81/10.83	9.20/12.67
Med42-v2	18.15/21.21	13.31/17.13	12.26/15.64	14.57/18.00
Baichuan 2	18.04/20.56	9.75/13.12	10.99/13.62	12.93/15.77
Llama3.1-Aloe-Beta-8B	19.58/21.86	14.68/19.64	6.62/11.92	13.63/17.80
Llama3.1-Aloe-Beta-70B	19.55/22.23	18.89/21.39	10.98/15.11	16.48/19.58
MMedIns-Llama 3	30.50/28.53	34.66/39.99	38.12/43.90	34.43/37.47

Rationale(BLEU/ROUGE)

This table evaluates various language models using BLEU and ROUGE metrics on rationale generation tasks.

Method	MMedBench(ZH)	MMedBench(EN)	MMedBench(FR)	MMedBench(JA)	MMedBench(RU)	MMedBench(ES)	Avg.
Claude-3.5	44.64/34.63	47.07/38.67	48.93/41.23	49.22/39.15	38.90/28.17	48.80/39.99	46.26/36.97
MEDITRON	20.39/21.79	38.42/31.24	34.43/29.33	18.89/24.98	24.32/16.77	37.64/31.01	29.01/25.86
InternLM 2	35.23/30.77	44.12/37.39	36.10/33.65	29.13/33.15	27.43/20.99	41.87/36.30	35.65/32.04
Llama 3	28.51/23.30	44.10/39.26	24.92/22.24	13.46/15.04	31.16/22.85	32.37/27.70	29.09/25.06
Mistral	35.53/28.91	47.20/37.88	39.53/35.64	29.16/28.96	32.15/23.99	45.27/38.33	38.14/32.28
Qwen 2	41.53/29.89	43.67/34.22	30.39/27.72	46.78/33.54	24.89/22.15	40.09/36.38	37.89/30.65
Med42-v2	19.42/17.55	47.22/39.45	32.01/26.71	10.85/11.52	26.87/20.35	32.00/24.58	28.06/23.36
Baichuan 2	32.09/26.70	39.52/32.09	17.74/17.57	14.63/13.52	18.38/15.06	31.85/28.12	25.70/22.18
Llama3.1-Aloe-Beta-8B	48.18/35.81	44.08/38.86	40.52/38.82	48.29/36.67	34.59/25.44	45.42/39.62	43.51/35.87
Llama3.1-Aloe-Beta-70B	50.22/36.18	49.10/40.05	45.56/40.32	53.31/38.62	38.43/28.09	51.02/43.23	47.94/37.75
MMedIns-Llama 3	50.27/34.01	49.08/38.19	46.93/38.73	51.74/35.19	35.27/23.81	48.15/37.35	46.90/34.54

Name Entity Recognition (F1)

This table displays the F1-Score metrics for various language models across multiple NER tasks, highlighting their ability to recognize chemical, disease, and organism entities.

Method	BC4Chem	BC5Chem	BC5Disease	Species800	Avg.
GPT-4	54.84	67.62	53.20	62.43	59.52
Claude-3.5	22.98	40.77	24.05	14.45	25.56
MEDITRON	-	-	-	-	-
InternLM 2	41.21	41.51	37.11	62.93	45.69
Llama 3	19.45	37.83	25.30	11.90	23.62
Mistral	15.56	32.09	12.17	6.31	16.53
Qwen 2	29.17	39.60	20.70	52.72	35.55
Med42-v2	23.51	36.52	26.95	11.92	24.73
Baichuan 2	11.96	18.68	11.47	34.56	19.17
Llama3.1-Aloe-Beta-8B	63.00	72.21	54.03	49.60	59.71
Llama3.1-Aloe-Beta-70B	40.19	45.73	37.72	19.21	35.71
MMedIns-Llama 3	90.78	91.25	54.26	80.87	79.29

Diagnosis(Accuracy)

This table showcases the performance of various models in diagnosis. The dataset, DDXPlus utilizes a predefined list of diseases, from which models must select one based on the provided patient context. In this task, accuracy is used as the metric.

Method	DDXPlus
GPT-4	58.13
Claude-3.5	60.24
MEDITRON	29.53
InternLM 2	35.20
Llama 3	33.73
Mistral	34.80
Qwen 2	34.07
Med42-v2	34.13
Baichuan 2	19.20
Llama3.1-Aloe-Beta-8B	63.67
Llama3.1-Aloe-Beta-70B	60.00
MMedIns-Llama 3	97.53

Treatment Planning(Accuracy)

This table showcases the performance of various models in treatment planning. In the dataset, SEER, \ treatment recommendations are categorized into eight high-level types. Accuracy is used as the metric.

Method	SEER
GPT-4	84.73
Claude-3.5	92.93
MEDITRON	68.27
InternLM 2	62.33
Llama 3	56.07
Mistral	38.93
Qwen 2	22.27
Med42-v2	43.87
Baichuan 2	16.80
Llama3.1-Aloe-Beta-8B	66.67
Llama3.1-Aloe-Beta-70B	68.00
MMedIns-Llama 3	98.47

Clinical Outcome Prediction(Accuracy)

This table showcases the performance of various models in tlinical outcome prediction across different datasets.

Method	MIMIC4ED(Hospitalization)	MIMIC4ED(72h ED Revisit)	MIMIC4ED(Critical Triage)
GPT-4	61.20	58.07	60.13
Claude-3.5	65.80	57.91	68.53
MEDITRON	56.27	48.47	45.67
InternLM 2	58.80	55.13	52.80
Llama 3	39.07	9.27	8.80
Mistral	56.27	48.47	45.67
Qwen 2	57.60	56.67	53.53
Med42-v2	57.87	55.20	46.60
Baichuan 2	22.73	8.07	2.13
Llama3.1-Aloe-Beta-8B	50.67	20.00	20.00
Llama3.1-Aloe-Beta-70B	49.33	45.33	59.33
MMedIns-Llama 3	74.20	52.73	63.13

Text Classification(P,R,F1)

This table showcases the performance of various models in Text Classification. The dataset used is HoC, which is designed for multi-label classsification. The metrics used are precision, recall, and F1 score.

Method	HoC Precision	HoC Recall	HoC F1_score
GPT-4	61.07	80.23	68.06
Claude-3.5	58.43	79.84	66.74
MEDITRON	19.61	34.61	23.70
InternLM 2	20.65	82.24	31.09
Llama 3	32.40	52.03	38.37
Mistral	40.39	64.11	48.73
Qwen 2	37.78	53.81	40.29
Med42-v2	49.95	53.12	47.87
Baichuan 2	38.54	20.28	23.76
Llama3.1-Aloe-Beta-8B	84.37	66.62	73.45
Llama3.1-Aloe-Beta-70B	75.68	54.85	62.90
MMedIns-Llama 3	89.59	85.58	86.66

Fact Verification(Accuracy, BLEU/ROUGE)

This table details the performance of various models on fact verification. For PubMedQA Answer Verification and HealthFact Verification, the LLMs is tasked to select a single answer from a list of provided candidates, with accuracy employed as the evaluation metric. Conversely, for EBMS Justification Verification, which requires the LLMs to generate free-form text, performance is assessed using BLEU and ROUGE metrics.

Method	PubMedQA Answer Ver.	PUBLICHEALTH Health Fact Ver.	EMBS Justification Ver.
GPT-4	66.15	78.60	16.28/16.27
Claude-3.5	11.54	62.04	14.77/16.45
MEDITRON	25.23	32.66	11.58/15.78
InternLM 2	99.23	76.94	8.75/14.69
Llama 3	94.77	63.89	16.52/16.49
Mistral	57.38	69.78	15.98/16.43
Qwen 2	18.00	58.25	12.52/14.00
Med42-v2	73.23	78.54	15.63/15.86
Baichuan 2	79.38	47.98	14.97/15.99
Llama3.1-Aloe-Beta-8B	86.00	19.33	16.52/16.11
Llama3.1-Aloe-Beta-70B	55.33	54.00	17.03/15.93
MMedIns-Llama 3	97.08	79.55	12.71/14.65

Natural Language Inference(Accuracy, BLEU/ROUGE)

The results are measured with accuracy for the discriminative tasks~(selecting the right answer from a list of candidates) and BLEU/ROUGE metrics for the generative tasks~(generating free-form text answers).

Method	MedNLI Discriminative Task	MedNLI Generative Task
GPT-4	86.63	27.09/23.71
Claude-3.5	82.14	17.80/20.02
MEDITRON	60.83	4.42/14.08
InternLM 2	84.67	15.84/19.01
Llama 3	63.85	21.31/22.75
Mistral	71.59	13.03/15.47
Qwen 2	82.00	14.26/16.21
Med42-v2	77.57	12.24/15.29
Baichuan 2	53.94	14.99/17.27
Llama3.1-Aloe-Beta-8B	67.33	24.08/26.26
Llama3.1-Aloe-Beta-70B	82.67	22.99/24.52
MMedIns-Llama 3	86.71	23.52/25.17