Google’s Med-PaLM 2 achieved 86.5% accuracy on USMLE-style medical examinations in 2023, surpassing the expert threshold and establishing new benchmarks for clinical AI systems. Physicians preferred Med-PaLM 2 responses over human-generated answers across eight of nine evaluation axes, with 92.6% of outputs aligning with scientific consensus. The model demonstrated significantly safer performance compared to general-purpose AI systems while maintaining comparable accuracy to GPT-4.
Med-PaLM 2 Key Statistics
- Med-PaLM 2 reached 86.5% accuracy on the MedQA benchmark, representing a 19 percentage point improvement over its predecessor
- Physicians rated Med-PaLM 2 outputs as safer and more aligned with medical consensus 72.9% of the time compared to physician-generated responses
- The model achieved 90.6% low risk of harm rating across adversarial safety testing scenarios as of January 2025
- Med-PaLM M multimodal variants process six different data modalities across 14 biomedical tasks using architectures ranging from 12 billion to 562 billion parameters
- Development progressed from initial USMLE passing scores to expert-level performance within four months between December 2022 and March 2023
Med-PaLM 2 Benchmark Accuracy Metrics
Med-PaLM 2 established state-of-the-art performance across multiple standardized medical examination datasets. The model achieved 86.5% accuracy on MedQA questions modeled after United States Medical Licensing Examination formats, exceeding the 60% passing threshold by substantial margins.
On MedMCQA questions derived from Indian medical examinations, Med-PaLM 2 became the first AI system to surpass passing scores with 72.3% accuracy. This demonstrated the model’s ability to generalize across diverse medical education frameworks and regional clinical knowledge requirements.
| Benchmark Dataset | Med-PaLM 2 Accuracy | Previous Version | Improvement |
|---|---|---|---|
| MedQA (USMLE-style) | 86.5% | 67.6% | +19% |
| MedMCQA (Indian Exams) | 72.3% | 57.6% | +14.7% |
| PubMedQA | 81.8% | 79.0% | +2.8% |
Med-PaLM 2 Physician Evaluation Results
Human evaluation revealed physician preferences for Med-PaLM 2 outputs across 1,066 consumer medical questions. Evaluators compared responses from Med-PaLM 2 against answers provided by practicing physicians across nine clinically relevant assessment dimensions.
Physicians preferred Med-PaLM 2 answers 72.9% of the time for scientific consensus alignment, with statistical significance at P < 0.001. The model demonstrated superior performance in reading comprehension, medical reasoning, and lower likelihood of causing harm compared to physician-generated responses.
The evaluation involved 15 physicians from the United States, United Kingdom, and India who rated responses across standardized criteria. Med-PaLM 2 outperformed human physicians on all evaluation axes except one dimension related to inclusion of inaccurate or irrelevant information.
Med-PaLM 2 Safety Assessment Data
Safety evaluation forms a critical component of medical AI deployment readiness. Med-PaLM 2 achieved 90.6% low risk of harm rating across adversarial testing scenarios, representing an 11.2 percentage point improvement over the previous version.
The model demonstrated 92.6% alignment with scientific consensus and showed no detectable demographic bias across specific population subgroups. When compared against GPT-4 and GPT-3.5 on a 140-question MultiMedQA subset, Med-PaLM 2 produced significantly safer outputs with lower potential for patient harm.
Med-PaLM 2 Compared to GPT-4 Performance
Direct benchmark comparisons revealed competitive positioning between leading medical AI systems. Med-PaLM 2 achieved marginally higher accuracy than GPT-4-base on the MedQA benchmark with scores of 86.5% versus 86.1% respectively.
The medical domain-specific fine-tuning approach contributed to Med-PaLM 2’s superior safety characteristics in physician evaluations. General-purpose models like GPT-3.5 scored 60.2% on the same benchmark, while Flan-PaLM reached 67.6% accuracy.
| Model | MedQA Accuracy | Training Type |
|---|---|---|
| Med-PaLM 2 | 86.5% | Medical Domain Fine-Tuned |
| GPT-4-base | 86.1% | General Purpose |
| Flan-PaLM | 67.6% | Instruction Fine-Tuned |
| GPT-3.5 | 60.2% | General Purpose |
Med-PaLM 2 Clinical Consultation Performance
Real-world pilot studies examined bedside consultation questions submitted by specialist physicians during routine care delivery. Specialist physicians preferred Med-PaLM 2 answers over generalist physician responses 65% of the time.
Generalist evaluators rated Med-PaLM 2 and generalist physician responses approximately equal at 50% preference. Both specialist and generalist physicians rated Med-PaLM 2 as equally safe compared to physician-generated answers across all evaluation criteria.
When comparing Med-PaLM 2 against specialist physician responses, both specialist and generalist evaluators showed 40% preference rates. This suggests the AI system performs comparably to specialist expertise in specific clinical scenarios while maintaining consistent safety profiles.
Med-PaLM M Multimodal Architecture
Med-PaLM M extends the architecture into multimodal capabilities, processing medical images, genomic data, and clinical text within unified model frameworks. Three parameter variants demonstrate scaling characteristics across 14 diverse biomedical tasks.
| Model Variant | Parameters | Supported Modalities | MultiMedBench Tasks |
|---|---|---|---|
| Med-PaLM M (Small) | 12 billion | 6 modalities | 14 tasks |
| Med-PaLM M (Medium) | 84 billion | 6 modalities | 14 tasks |
| Med-PaLM M (Large) | 562 billion | 6 modalities | 14 tasks |
The 84 billion parameter variant achieved optimal balance between accuracy and error rates in radiology report generation. Clinicians expressed pairwise preference for Med-PaLM M reports over radiologist-produced reports in up to 40.5% of cases across 246 retrospective chest X-ray evaluations.
Med-PaLM M demonstrated zero-shot generalization capabilities, accurately identifying tuberculosis presentations in chest X-ray images despite receiving no prior training on tuberculosis-specific visual data. The MultiMedBench benchmark encompasses tasks including medical question answering, visual question answering, image classification, radiology report generation, and genomic variant calling.
Similar to machine learning systems transforming content creation, medical AI models leverage deep learning architectures to process complex multimodal data at scale.
Med-PaLM 2 Technical Architecture
Med-PaLM 2 incorporates multiple architectural and training innovations contributing to performance improvements across medical benchmarks. The base architecture utilizes PaLM 2 with compute-optimal scaling principles.
The Ensemble Refinement prompting strategy improved accuracy across multiple-choice benchmarks by conditioning model outputs on multiple generated explanations before producing final answers. Chain of Retrieval enhanced factuality by grounding claims through external medical information retrieval during the generation process.
Training datasets included MedQA, MedMCQA, HealthSearchQA, LiveQA, and MedicationQA. The evaluation framework encompassed MultiMedQA with seven datasets and 14 assessment criteria applied across 1,066 consumer questions.
Med-PaLM 2 MMLU Clinical Performance
The Massive Multitask Language Understanding benchmark includes specialized clinical topic subsets testing domain-specific medical knowledge. Med-PaLM 2 achieved state-of-the-art performance on three of six MMLU clinical topics.
The model reached state-of-the-art scores in Clinical Knowledge, Medical Genetics, and Anatomy. GPT-4-based systems reported higher scores on Professional Medicine, College Medicine, and College Biology topics.
The relatively small test set sizes for individual MMLU topics warrant cautious interpretation of marginal performance differences between competing systems. Both Med-PaLM 2 and GPT-4 demonstrated strong capabilities across clinical knowledge domains.
Med-PaLM 2 Development Timeline
The Med-PaLM development trajectory demonstrated rapid advancement in medical AI capabilities. The progression from initial USMLE passing scores to expert-level performance occurred within approximately four months.
Med-PaLM achieved 67.6% accuracy in December 2022, becoming the first AI to pass USMLE threshold requirements. By March 2023, Med-PaLM 2 reached 86.5% accuracy, marking expert-level performance in clinical question-answering tasks.
The January 2025 issue of Nature Medicine published the most comprehensive peer-reviewed evaluation of Med-PaLM 2 capabilities. This publication provided detailed analysis of safety characteristics, physician preferences, and benchmark performance across multiple medical assessment frameworks.
As AI integration becomes core to computing platforms, specialized medical AI systems demonstrate the potential for domain-specific applications requiring both accuracy and safety considerations.
FAQ
What accuracy did Med-PaLM 2 achieve on medical exams?
Med-PaLM 2 achieved 86.5% accuracy on MedQA (USMLE-style questions) and 72.3% on MedMCQA (Indian medical exams), surpassing expert-level thresholds and previous AI benchmarks by substantial margins.
How does Med-PaLM 2 compare to GPT-4?
Med-PaLM 2 scored 86.5% on MedQA versus GPT-4’s 86.1%, showing marginally higher accuracy. Med-PaLM 2 demonstrated significantly safer outputs and better alignment with scientific consensus in physician evaluations.
What is Med-PaLM M?
Med-PaLM M is the multimodal extension of Med-PaLM 2, processing medical images, genomic data, and clinical text. Three variants range from 12 billion to 562 billion parameters across 14 biomedical tasks.
How safe is Med-PaLM 2 for medical use?
Med-PaLM 2 achieved 90.6% low risk of harm rating and 92.6% scientific consensus alignment. Physicians rated it as equally safe or safer than human-generated medical answers across evaluation criteria.
When was Med-PaLM 2 released?
Med-PaLM 2 was announced in March 2023, with peer-reviewed validation published in Nature Medicine in July 2023. The most comprehensive evaluation study was published in January 2025.

