Medical Multilingual Benchmark

About

The task of MMedBench is to answer medical muti-choice questions from 6 different languages. In addition, each question has a rationale for the choice. abstracts.

For more details about MMedBench, please refer to this paper:

Dataset

MMedBench is a medical muti-choice dataset of 6 different languages. It contains 45k samples for the trainset and 8,518 samples for the testset. Each question is company with a right answer and high quality rationale.

Please visit our GitHub repository to download the dataset:

Submission

To submit your model, please follow the instructions in the GitHub repository.

Citation

If you use MMedBench in your research, please cite our paper by:

@misc{qiu2024building,
  title={Towards Building Multilingual Language Model for Medicine}, 
  author={Pengcheng Qiu and Chaoyi Wu and Xiaoman Zhang and Weixiong Lin and Haicheng Wang and Ya Zhang and Yanfeng Wang and Weidi Xie},
  year={2024},
  eprint={2402.13963},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Leaderboard

	Model	Size	Accuray (%)	Rationale(BLEU-1)
1	GPT-4 paperlink	NA	74.27	NA
2	MMed-Llama 3 paperlink	8B	67.75	47.21
3	MMedLM 2 paperlink	7B	67.30	48.81
4	Llama 3 paperlink	8B	62.79	46.76
5	Mistral paperlink	7B	60.73	45.37
6	InternLM 2	7B	58.59	46.52
7	BioMistral paperlink	7B	57.45	45.93
8	Gemini-1.0 pro paperlink	NA	55.20	7.28
9	MMedLM paperlink	7B	55.01	45.05
10	MEDITRON paperlink	7B	52.23	45.08
11	GPT-3.5 paperlink	NA	51.82	26.01
12	InternLM paperlink	7B	45.67	42.12
13	BLOOMZ paperlink	7B	45.10	43.22
14	LLaMA 2 paperlink	7B	42.26	44.24
15	Med-Alpaca paperlink	7B	41.11	43.49
16	PMC-LLaMA paperlink	7B	40.04	43.16
17	ChatDoctor paperlink	7B	39.53	42.21