Anyone running Qwen3 VL embeddings?

Reddit r/LocalLLaMA2026/02/11 22:53机翻/自动摘要/自动分类

摘要

So I've been trying to get the Qwen3 VL Embedding 2B model running locally with vLLM following the official instructions and I'm kinda confused by the...

正文

So I've been trying to get the Qwen3 VL Embedding 2B model running locally with vLLM following the official instructions and I'm kinda confused by the vram usage. On my 4090 it's eating up 20+ gb even with a small 8k which seems insane for a 2B model. For comparison I can run qwen3 vl 4b through ollama with a bigger and it uses way less vram. Has anyone actually gotten this model running efficiently? I feel like I'm missing something obvious here. Also wondering if there's any way to quantize it to Q4 or Q8 right now? I've looked around and can't find any proper quants besides an FP8 and some GGUFs that didn’t really work for me. compressor doesn’t seem to have support for it.