Phi-4 Multimodal Instruct

MicrosoftPhiOpen WeightMIT · Commercial OK

Description

Phi-4-multimodal-instruct is a lightweight (5.57B parameters) open multimodal foundation model that leverages research and datasets from Phi-3.5 and 4.0. It processes text, image, and audio inputs to generate text outputs, supporting a 128K token context length. Enhanced via SFT, DPO, and RLHF for instruction following and safety.

Release Date

2025-02-26

Parameters

5.6B

Context Length

—

Modalities

image, text

Capability Radar

general

coding

reasoning

scienceest.

agents

multimodal

Science uses a reasoning proxy when dedicated science benchmarks are unavailable.

Rankings

Domain	#Rank	Score	Source
Code Ranking	397	14.0	AA
General Ranking	404	24.0	AA
Math Reasoning	228	39.0	AA
Multimodal Ranking	25	82.0	LS
Science	431	17.0	AA

Benchmark Scores (LLM Stats)

3d

BLINK

61.3%SR

General

MMMU

55.1%SR

MMMU-Pro

38.5%SR

Image To Text

DocVQA

93.2%SR

OCRBench

84.4%SR

TextVQA

75.6%SR

Math

MathVista

62.4%SR

InterGPS

48.6%SR

Multimodal

ScienceQA Visual

97.5%SR

MMBench

86.7%SR

POPE

85.6%SR

AI2D

82.3%SR

ChartQA

81.4%SR

InfoVQA

72.7%SR

Video-MME

55.0%SR

AA Evaluation Indices

Intelligence Index

10.0

Math 500

0.7

Mmlu Pro

0.5

Gpqa

0.3

Livecodebench

0.1

Scicode

0.1

Aime

0.1

Hle

0.0

LLM Stats Category Scores

Image To Text

Vision

Multimodal

Reasoning

Spatial Reasoning

Healthcare

Math

General

Pricing

Input PriceFree

Output PriceFree

Blended Price (3:1)Free

Speed

Tokens/sec16.7 tokens/s

Time to First Token0.50s

Time to Answer0.50s

Available Providers

(LS internal units)

No provider data available

External Sources

LLM Stats