Saltar al contenido principal

Qwen3 VL 4B Instruct

AlibabaQwenOpen WeightApache 2.0 · Uso Comercial

Descripción

Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding. The model supports up to a 1 million-token context window and excels at visual understanding, spatial reasoning, long video comprehension, and tool-based interaction. It can generate code from images, perform precise 2D/3D object grounding, and operate digital interfaces like a visual agent. The “Instruct” version rivals Gemini 2.5 Pro in perception benchmarks, while the “Thinking” version leads in multimodal reasoning and STEM tasks. With multilingual OCR, creative writing, and fine-grained scene interpretation, Qwen3-VL establishes a new open-source frontier for integrated vision-language intelligence.

Fecha de lanzamiento
2025-10-14
Parámetros
4.0B
Longitud del contexto
Modalidades
image, text

Radar de capacidades

22
general
26
coding
37
reasoning
22
scienceest.
60
agents
90
multimodal

Science usa un proxy de razonamiento cuando los benchmarks científicos dedicados no están disponibles.

Rankings

Dominio#PosiciónPuntuaciónFuente
Capacidad agéntica94
48.0
LS
Ranking de codificación407
16.0
AA
Ranking general405
26.0
AA
Razonamiento matemático237
37.0
AA
Ranking multimodal63
72.0
LS
Razonamiento95
48.0
LS
Ciencia444
20.0
AA

Puntuaciones de benchmarks (LLM Stats)

3d

BLINK65.8%Aut.

Agents

BFCL-v363.3%Aut.
OSWorld26.2%Aut.

Chemistry

SuperGPQA40.3%Aut.

Communication

MM-MT-Bench7.50 / 100Aut.
WritingBench82.5%Aut.

Factuality

SimpleQA48.0%Aut.

Finance

MMLU77.2%Aut.
MMLU-Pro67.1%Aut.
MMLU-ProX59.4%Aut.

General

IFEval82.3%Aut.
MMLU-Redux81.5%Aut.
MLVU-M75.3%Aut.
MMStar69.8%Aut.
MMMU (val)67.4%Aut.
Include61.4%Aut.
LiveBench 2024112560.9%Aut.
MMMU-Pro53.2%Aut.
LiveCodeBench v637.9%Aut.

Grounding

ScreenSpot94.0%Aut.
ScreenSpot Pro59.5%Aut.

Healthcare

VideoMMMU56.2%Aut.

Image To Text

OCRBench88.1%Aut.
OCRBench-V2 (en)63.7%Aut.
OCRBench-V2 (zh)57.6%Aut.

Language

CharadesSTA55.5%Aut.

Long Context

LVBench56.2%Aut.

Math

MathVista-Mini73.7%Aut.
MathVision51.6%Aut.
AIME 202546.6%Aut.
HMMT2530.7%Aut.
PolyMATH28.8%Aut.

Multimodal

DocVQAtest95.3%Aut.
MMBench-V1.185.1%Aut.
AI2D84.1%Aut.
InfoVQAtest80.3%Aut.
CharXiv-D76.2%Aut.
CC-OCR76.2%Aut.
MVBench68.9%Aut.
MuirBench63.8%Aut.
CharXiv-R39.7%Aut.

Reasoning

Hallusion Bench57.6%Aut.
ERQA41.3%Aut.

Spatial Reasoning

RealWorldQA70.9%Aut.

Vision

ODinW48.2%Aut.

Índices de evaluación AA

Math Index
37.0
Intelligence Index
4.1
Mmlu Pro
0.6
Gpqa
0.4
Aime 25
0.4
Ifbench
0.3
Livecodebench
0.3
Tau2
0.2
Scicode
0.1
Lcr
0.1
Hle
0.0
Terminalbench Hard
0.0

Puntuaciones por categoría LLM Stats

Communication
4
Multimodal
90
Instruction Following
80
Grounding
80
Creativity
80
Text-to-image
80
Writing
80
Image To Text
70
Language
70
Legal
70
Structured Output
70
3d
70
Long Context
60
Math
60
Reasoning
60
Spatial Reasoning
60
Finance
60
General
60
Healthcare
60
Tool Calling
60
Video
60
Vision
60
Factuality
50
Physics
40
Agents
40
Chemistry
40
Economics
40

Precios

Precio de entradaGratis
Precio de salidaGratis
Precio mixto (3:1)Gratis

Velocidad

Tokens/seg0.0
Retraso del primer token0.00s
Tiempo hasta la respuesta0.00s

Ranking de Precios por Proveedor

Ranking de Precios por Proveedor

1 proveedores

ProveedorEntradaSalida
1DeepInfra
$0
$0

Comparar precios entre diferentes proveedores de API para este modelo.

Fuentes externas