We developed an approach to tackle this problem, and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice ques- tions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including medical exams and clinical records that were carefully curated, crafted, and annotated by medical experts. Our contributions are:
Our benchmark includes 34 distinct medical categories, carefully selected to provide a holistic evaluation of medical knowledge. This extensive categorization 6 serves multiple critical purposes
We divide the dataset into 4 difficulty levels: easy, moderat, challenging and hard.
We illustrate below the stage of our data curation and validation pipeline to ensure the robustness and quality of the data. The pipeline consists of several stages as below:
LLM Performance on when
ensembling
@misc{nguyen2025vm14kvietnamesemedicalbenchmark,
title={VM14K: First Vietnamese Medical Benchmark},
author={Thong Nguyen and Duc Nguyen and Minh Dang and Thai Dao and Long Nguyen and Quan H. Nguyen and Dat Nguyen and Kien Tran and Minh Tran},
year={2025},
eprint={2506.01305},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.01305},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Related Links:
[REACT]
[GLIGEN]
[Computer Vision in the Wild (CVinW)]
[Insutrction Tuning with GPT-4]