Medical Multilingual Benchmark

About

The task of MMedBench is to answer medical muti-choice questions from 6 different languages. In addition, each question has a rationale for the choice. abstracts.

For more details about MMedBench, please refer to this paper:

Dataset

MMedBench is a medical muti-choice dataset of 6 different languages. It contains 45k samples for the trainset and 8,518 samples for the testset. Each question is company with a right answer and high quality rationale.

Please visit our GitHub repository to download the dataset:

Submission

To submit your model, please follow the instructions in the GitHub repository.

Citation

If you use MMedBench in your research, please cite our paper by:

@misc{qiu2024building,
  title={Towards Building Multilingual Language Model for Medicine}, 
  author={Pengcheng Qiu and Chaoyi Wu and Xiaoman Zhang and Weixiong Lin and Haicheng Wang and Ya Zhang and Yanfeng Wang and Weidi Xie},
  year={2024},
  eprint={2402.13963},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Leaderboard

	Model	Size	Accuray (%)	Rationale(BLEU-1)
1	GPT-4 paperlink	NA	74.27	NA
2	MMedLM 2 paperlink	7B	67.30	48.81
3	Mistral paperlink	7B	60.73	45.37
4	InternLM 2	7B	58.59	46.52
5	Gemini-1.0 pro paperlink	NA	55.20	7.28
6	MMedLM paperlink	7B	55.01	45.05
7	GPT-3.5 paperlink	NA	51.82	26.01
8	InternLM paperlink	7B	45.67	42.12
9	BLOOMZ paperlink	7B	45.10	43.22
10	LLaMA 2 paperlink	7B	42.26	44.24
11	Med-Alpaca paperlink	7B	41.11	43.49
12	PMC-LLaMA paperlink	7B	40.04	43.16
13	ChatDoctor paperlink	7B	39.53	42.21