Llama 3.2 Instruct 11B (Vision)
MetaLlamaOpen WeightLlama 3.2 Community License
Description
Llama 3.2 11B Vision Instruct is an instruction-tuned multimodal large language model optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. It accepts text and images as input and generates text as output.
Release Date
2024-09-25
Parameters
10.6B
Context Length
131K
Modalities
image, text
Capability Radar
19
general
7
coding
13
reasoning
15
scienceest.
0
agents
90
multimodal
Science uses a reasoning proxy when dedicated science benchmarks are unavailable.
Rankings
| Domain | #Rank | Score | Source |
|---|---|---|---|
| Code Ranking | 432 | 8.0 | AA |
| General Ranking | 428 | 21.0 | AA |
| Math Reasoning | 328 | 13.0 | AA |
| Multimodal Ranking | 23 | 84.0 | LS |
| Science | 455 | 14.0 | AA |
Benchmark Scores (LLM Stats)
Biology
GPQA
32.8%SR
Finance
MMLU
73.0%SR
General
MMMU
50.7%SR
MMMU-Pro
33.0%SR
Image To Text
DocVQA
88.4%SR
VQAv2 (test)
75.2%SR
Math
MGSM
68.9%SR
MATH
51.9%SR
MathVista
51.5%SR
Multimodal
AI2D
91.1%SR
ChartQA
83.4%SR
AA Evaluation Indices
Intelligence Index8.7
Coding Index4.3
Math Index1.7
Math 5000.5
Mmlu Pro0.5
Ifbench0.3
Gpqa0.2
Tau20.1
Lcr0.1
Scicode0.1
Livecodebench0.1
Aime0.1
Hle0.1
Aime 250.0
Terminalbench Hard0.0
LLM Stats Category Scores
Image To Text90
Vision70
Finance70
Language70
Legal70
Multimodal70
Healthcare60
Math60
Reasoning60
General50
Biology30
Chemistry30
Physics30
Pricing
Input Price$0.245 / 1M tokens
Output Price$0.245 / 1M tokens
Blended Price (3:1)$0.245 / 1M tokens
Speed
Tokens/sec86.7 tokens/s
Time to First Token0.52s
Time to Answer0.52s
Available Providers
(LS internal units)No provider data available