Med-PaLM 2 Statistics 2026

Med-PaLM 2 Statistics 2026

Google’s Med-PaLM 2 achieved 86.5% accuracy on USMLE-style medical examinations in 2023, surpassing the expert threshold and establishing new benchmarks for clinical AI systems. Physicians preferred Med-PaLM 2 responses over human-generated answers across eight of nine evaluation axes, with 92.6% of outputs aligning with scientific consensus. The model demonstrated significantly safer performance compared to general-purpose AI systems while maintaining comparable accuracy to GPT-4.

Med-PaLM 2 Key Statistics

  • Med-PaLM 2 reached 86.5% accuracy on the MedQA benchmark, representing a 19 percentage point improvement over its predecessor
  • Physicians rated Med-PaLM 2 outputs as safer and more aligned with medical consensus 72.9% of the time compared to physician-generated responses
  • The model achieved 90.6% low risk of harm rating across adversarial safety testing scenarios as of January 2025
  • Med-PaLM M multimodal variants process six different data modalities across 14 biomedical tasks using architectures ranging from 12 billion to 562 billion parameters
  • Development progressed from initial USMLE passing scores to expert-level performance within four months between December 2022 and March 2023

Med-PaLM 2 Benchmark Accuracy Metrics

Med-PaLM 2 established state-of-the-art performance across multiple standardized medical examination datasets. The model achieved 86.5% accuracy on MedQA questions modeled after United States Medical Licensing Examination formats, exceeding the 60% passing threshold by substantial margins.

On MedMCQA questions derived from Indian medical examinations, Med-PaLM 2 became the first AI system to surpass passing scores with 72.3% accuracy. This demonstrated the model’s ability to generalize across diverse medical education frameworks and regional clinical knowledge requirements.

Benchmark Dataset Med-PaLM 2 Accuracy Previous Version Improvement
MedQA (USMLE-style) 86.5% 67.6% +19%
MedMCQA (Indian Exams) 72.3% 57.6% +14.7%
PubMedQA 81.8% 79.0% +2.8%

Med-PaLM 2 Physician Evaluation Results

Human evaluation revealed physician preferences for Med-PaLM 2 outputs across 1,066 consumer medical questions. Evaluators compared responses from Med-PaLM 2 against answers provided by practicing physicians across nine clinically relevant assessment dimensions.

Physicians preferred Med-PaLM 2 answers 72.9% of the time for scientific consensus alignment, with statistical significance at P < 0.001. The model demonstrated superior performance in reading comprehension, medical reasoning, and lower likelihood of causing harm compared to physician-generated responses.

The evaluation involved 15 physicians from the United States, United Kingdom, and India who rated responses across standardized criteria. Med-PaLM 2 outperformed human physicians on all evaluation axes except one dimension related to inclusion of inaccurate or irrelevant information.

Med-PaLM 2 Safety Assessment Data

Safety evaluation forms a critical component of medical AI deployment readiness. Med-PaLM 2 achieved 90.6% low risk of harm rating across adversarial testing scenarios, representing an 11.2 percentage point improvement over the previous version.

The model demonstrated 92.6% alignment with scientific consensus and showed no detectable demographic bias across specific population subgroups. When compared against GPT-4 and GPT-3.5 on a 140-question MultiMedQA subset, Med-PaLM 2 produced significantly safer outputs with lower potential for patient harm.

Med-PaLM 2 Compared to GPT-4 Performance

Direct benchmark comparisons revealed competitive positioning between leading medical AI systems. Med-PaLM 2 achieved marginally higher accuracy than GPT-4-base on the MedQA benchmark with scores of 86.5% versus 86.1% respectively.

The medical domain-specific fine-tuning approach contributed to Med-PaLM 2’s superior safety characteristics in physician evaluations. General-purpose models like GPT-3.5 scored 60.2% on the same benchmark, while Flan-PaLM reached 67.6% accuracy.

Model MedQA Accuracy Training Type
Med-PaLM 2 86.5% Medical Domain Fine-Tuned
GPT-4-base 86.1% General Purpose
Flan-PaLM 67.6% Instruction Fine-Tuned
GPT-3.5 60.2% General Purpose

Med-PaLM 2 Clinical Consultation Performance

Real-world pilot studies examined bedside consultation questions submitted by specialist physicians during routine care delivery. Specialist physicians preferred Med-PaLM 2 answers over generalist physician responses 65% of the time.

Generalist evaluators rated Med-PaLM 2 and generalist physician responses approximately equal at 50% preference. Both specialist and generalist physicians rated Med-PaLM 2 as equally safe compared to physician-generated answers across all evaluation criteria.

When comparing Med-PaLM 2 against specialist physician responses, both specialist and generalist evaluators showed 40% preference rates. This suggests the AI system performs comparably to specialist expertise in specific clinical scenarios while maintaining consistent safety profiles.

Med-PaLM M Multimodal Architecture

Med-PaLM M extends the architecture into multimodal capabilities, processing medical images, genomic data, and clinical text within unified model frameworks. Three parameter variants demonstrate scaling characteristics across 14 diverse biomedical tasks.

Model Variant Parameters Supported Modalities MultiMedBench Tasks
Med-PaLM M (Small) 12 billion 6 modalities 14 tasks
Med-PaLM M (Medium) 84 billion 6 modalities 14 tasks
Med-PaLM M (Large) 562 billion 6 modalities 14 tasks

The 84 billion parameter variant achieved optimal balance between accuracy and error rates in radiology report generation. Clinicians expressed pairwise preference for Med-PaLM M reports over radiologist-produced reports in up to 40.5% of cases across 246 retrospective chest X-ray evaluations.

Med-PaLM M demonstrated zero-shot generalization capabilities, accurately identifying tuberculosis presentations in chest X-ray images despite receiving no prior training on tuberculosis-specific visual data. The MultiMedBench benchmark encompasses tasks including medical question answering, visual question answering, image classification, radiology report generation, and genomic variant calling.

Similar to machine learning systems transforming content creation, medical AI models leverage deep learning architectures to process complex multimodal data at scale.

Med-PaLM 2 Technical Architecture

Med-PaLM 2 incorporates multiple architectural and training innovations contributing to performance improvements across medical benchmarks. The base architecture utilizes PaLM 2 with compute-optimal scaling principles.

The Ensemble Refinement prompting strategy improved accuracy across multiple-choice benchmarks by conditioning model outputs on multiple generated explanations before producing final answers. Chain of Retrieval enhanced factuality by grounding claims through external medical information retrieval during the generation process.

Training datasets included MedQA, MedMCQA, HealthSearchQA, LiveQA, and MedicationQA. The evaluation framework encompassed MultiMedQA with seven datasets and 14 assessment criteria applied across 1,066 consumer questions.

Med-PaLM 2 MMLU Clinical Performance

The Massive Multitask Language Understanding benchmark includes specialized clinical topic subsets testing domain-specific medical knowledge. Med-PaLM 2 achieved state-of-the-art performance on three of six MMLU clinical topics.

The model reached state-of-the-art scores in Clinical Knowledge, Medical Genetics, and Anatomy. GPT-4-based systems reported higher scores on Professional Medicine, College Medicine, and College Biology topics.

The relatively small test set sizes for individual MMLU topics warrant cautious interpretation of marginal performance differences between competing systems. Both Med-PaLM 2 and GPT-4 demonstrated strong capabilities across clinical knowledge domains.

Med-PaLM 2 Development Timeline

The Med-PaLM development trajectory demonstrated rapid advancement in medical AI capabilities. The progression from initial USMLE passing scores to expert-level performance occurred within approximately four months.

Med-PaLM achieved 67.6% accuracy in December 2022, becoming the first AI to pass USMLE threshold requirements. By March 2023, Med-PaLM 2 reached 86.5% accuracy, marking expert-level performance in clinical question-answering tasks.

The January 2025 issue of Nature Medicine published the most comprehensive peer-reviewed evaluation of Med-PaLM 2 capabilities. This publication provided detailed analysis of safety characteristics, physician preferences, and benchmark performance across multiple medical assessment frameworks.

As AI integration becomes core to computing platforms, specialized medical AI systems demonstrate the potential for domain-specific applications requiring both accuracy and safety considerations.

FAQ

What accuracy did Med-PaLM 2 achieve on medical exams?

Med-PaLM 2 achieved 86.5% accuracy on MedQA (USMLE-style questions) and 72.3% on MedMCQA (Indian medical exams), surpassing expert-level thresholds and previous AI benchmarks by substantial margins.

How does Med-PaLM 2 compare to GPT-4?

Med-PaLM 2 scored 86.5% on MedQA versus GPT-4’s 86.1%, showing marginally higher accuracy. Med-PaLM 2 demonstrated significantly safer outputs and better alignment with scientific consensus in physician evaluations.

What is Med-PaLM M?

Med-PaLM M is the multimodal extension of Med-PaLM 2, processing medical images, genomic data, and clinical text. Three variants range from 12 billion to 562 billion parameters across 14 biomedical tasks.

How safe is Med-PaLM 2 for medical use?

Med-PaLM 2 achieved 90.6% low risk of harm rating and 92.6% scientific consensus alignment. Physicians rated it as equally safe or safer than human-generated medical answers across evaluation criteria.

When was Med-PaLM 2 released?

Med-PaLM 2 was announced in March 2023, with peer-reviewed validation published in Nature Medicine in July 2023. The most comprehensive evaluation study was published in January 2025.