Qwen3 VL 8B Instruct
説明
Qwen3-VL is a large multimodal model that unifies vision, language, and reasoning to achieve human-level perception and cognition across text, images, and video. Built on a 235B-parameter architecture, it integrates early joint training of visual and textual modalities for strong language grounding. The model supports up to a 1 million-token context window and excels at visual understanding, spatial reasoning, long video comprehension, and tool-based interaction. It can generate code from images, perform precise 2D/3D object grounding, and operate digital interfaces like a visual agent. The “Instruct” version rivals Gemini 2.5 Pro in perception benchmarks, while the “Thinking” version leads in multimodal reasoning and STEM tasks. With multilingual OCR, creative writing, and fine-grained scene interpretation, Qwen3-VL establishes a new open-source frontier for integrated vision-language intelligence.
能力レーダー
専門的な科学ベンチマークが利用できない場合、Scienceは推論プロキシを使用して推定します。
ランキング
| ドメイン | #順位 | スコア | ソース |
|---|---|---|---|
| Agents & Tools | 64 | 53.0 | LS |
| Code Ranking | 369 | 17.0 | AA |
| General Ranking | 332 | 34.0 | AA |
| Math Reasoning | 274 | 28.0 | AA |
| Multimodal Ranking | 50 | 74.0 | LS |
| Reasoning | 83 | 52.0 | LS |
| Science | 400 | 24.0 | AA |
ベンチマークスコア (LLM Stats)
3d
Agents
Chemistry
Communication
Finance
General
Grounding
Healthcare
Image To Text
Language
Long Context
Math
Multimodal
Reasoning
Spatial Reasoning
Vision
AA評価指数
LLM Statsカテゴリスコア
価格設定
速度
利用可能なプロバイダー
(LS内部単位)| プロバイダー | 入力価格 | 出力価格 |
|---|---|---|
| Novita | 80K | 500K |
| DeepInfra | 180K | 690K |