VM14K: First Vietnamese Medical Benchmark

Authors affilications
1 Vietnam National University 2 Dickinson College 3 Columbia University 4 Venera AI 5 Carnegie Mellon University 6 University of Maryland 7 Foreign Trade University

Abstract

We developed an approach to tackle this problem, and applied it to create the first Vietnamese medical question benchmark, featuring 14,000 multiple-choice ques- tions across 34 medical specialties. Our benchmark was constructed using various verifiable sources, including medical exams and clinical records that were carefully curated, crafted, and annotated by medical experts. Our contributions are:

  1. We developed a scalable framework that can mine local medical resources from various sources.
  2. We defined a simple yet flexible standard for designing medical benchmarks.
  3. We applied our method to create a medical benchmark for Vietnamese.

Benchmark Design

Our benchmark includes 34 distinct medical categories, carefully selected to provide a holistic evaluation of medical knowledge. This extensive categorization 6 serves multiple critical purposes

  • Complete Coverage of Medical Practice
  • Balanced Representation
  • Integrated Medical Knowledge
  • Public Health and Preventive Focus
  • Comprehensive Evaluation
  • Nuanced Assessment
  • Alignment with Medical Expertise

We divide the dataset into 4 difficulty levels: easy, moderat, challenging and hard.

Data curation and validation pipeline

We illustrate below the stage of our data curation and validation pipeline to ensure the robustness and quality of the data. The pipeline consists of several stages as below:

  • Extract data
  • Transformation
  • Transfer data
  • Orchestral workflow
  • Incremental extracting
  • LLM clean and validation
  • Experts correctness validation

Baseline performance on LLM model and opensource medical model

LLM Performance on pass@k metrics

LLM Performance on when ensembling

BibTeX


@misc{nguyen2025vm14kvietnamesemedicalbenchmark,
      title={VM14K: First Vietnamese Medical Benchmark}, 
      author={Thong Nguyen and Duc Nguyen and Minh Dang and Thai Dao and Long Nguyen and Quan H. Nguyen and Dat Nguyen and Kien Tran and Minh Tran},
      year={2025},
      eprint={2506.01305},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01305}, 
}

  
  
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Related Links: [REACT] [GLIGEN] [Computer Vision in the Wild (CVinW)] [Insutrction Tuning with GPT-4]