[New Model]: jinaai/jina-embeddings-v3
The model to consider.
https://huggingface.co/jinaai/jina-embeddings-v3 jina-embeddings-v3 is a multilingual multi-task text embedding model designed for a variety of NLP applications. Based on the Jina-XLM-RoBERTa architecture, this model supports Rotary Position Embeddings to handle long input sequences up to 8192 tokens. Additionally, it features 5 LoRA adapters to generate task-specific embeddings efficiently.
The closest model vllm already supports.
No response
What's your difficulty of supporting the model you want?
In subsequent version updates, has the project team considered adding support for jinaai/jina-embeddings-v3?
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
2025-02-23T13:07:02+00:00 - gpustack.worker.backends.vllm - INFO - Starting vllm server
INFO 02-23 13:07:05 init.py:207] Automatically detected platform cuda.
INFO 02-23 13:07:05 api_server.py:912] vLLM API server version 0.7.3
INFO 02-23 13:07:05 api_server.py:913] args: Namespace(subparser='serve', model_tag='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', config='', host='0.0.0.0', port=40183, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='transformers', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['jina-embeddings-v3'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f230876f0a0>)
INFO 02-23 13:07:05 api_server.py:209] Started engine process with PID 5097
INFO 02-23 13:07:07 init.py:207] Automatically detected platform cuda.
INFO 02-23 13:07:08 config.py:422] Found sentence-transformers modules configuration.
INFO 02-23 13:07:08 config.py:442] Found pooling configuration.
INFO 02-23 13:07:08 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score'}. Defaulting to 'embed'.
INFO 02-23 13:07:10 config.py:422] Found sentence-transformers modules configuration.
INFO 02-23 13:07:10 config.py:442] Found pooling configuration.
INFO 02-23 13:07:10 config.py:549] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed'}. Defaulting to 'embed'.
INFO 02-23 13:07:10 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', speculative_config=None, tokenizer='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=jina-embeddings-v3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 02-23 13:07:11 cuda.py:229] Using Flash Attention backend.
INFO 02-23 13:07:12 model_runner.py:1110] Starting to load model /var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3...
ERROR 02-23 13:07:13 engine.py:400] The Transformers implementation of XLMRobertaModel is not compatible with vLLM.
ERROR 02-23 13:07:13 engine.py:400] Traceback (most recent call last):
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
ERROR 02-23 13:07:13 engine.py:400] engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args
ERROR 02-23 13:07:13 engine.py:400] return cls(ipc_path=ipc_path,
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 76, in init
ERROR 02-23 13:07:13 engine.py:400] self.engine = LLMEngine(*args, **kwargs)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in init
ERROR 02-23 13:07:13 engine.py:400] self.model_executor = executor_class(vllm_config=vllm_config, )
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in init
ERROR 02-23 13:07:13 engine.py:400] self._init_executor()
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 02-23 13:07:13 engine.py:400] self.collective_rpc("load_model")
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 02-23 13:07:13 engine.py:400] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/utils.py", line 2196, in run_method
ERROR 02-23 13:07:13 engine.py:400] return func(*args, **kwargs)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
ERROR 02-23 13:07:13 engine.py:400] self.model_runner.load_model()
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1112, in load_model
ERROR 02-23 13:07:13 engine.py:400] self.model = get_model(vllm_config=self.vllm_config)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
ERROR 02-23 13:07:13 engine.py:400] return loader.load_model(vllm_config=vllm_config)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model
ERROR 02-23 13:07:13 engine.py:400] model = _initialize_model(vllm_config=vllm_config)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 115, in _initialize_model
ERROR 02-23 13:07:13 engine.py:400] model_class, _ = get_model_architecture(model_config)
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 106, in get_model_architecture
ERROR 02-23 13:07:13 engine.py:400] architectures = resolve_transformers_fallback(model_config,
ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 69, in resolve_transformers_fallback
ERROR 02-23 13:07:13 engine.py:400] raise ValueError(
ERROR 02-23 13:07:13 engine.py:400] ValueError: The Transformers implementation of XLMRobertaModel is not compatible with vLLM.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine
raise e
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args
return cls(ipc_path=ipc_path,
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 76, in init
self.engine = LLMEngine(*args, **kwargs)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in init
self.model_executor = executor_class(vllm_config=vllm_config, )
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in init
self._init_executor()
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("load_model")
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/utils.py", line 2196, in run_method
return func(*args, **kwargs)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1112, in load_model
self.model = get_model(vllm_config=self.vllm_config)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model
model = _initialize_model(vllm_config=vllm_config)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 115, in _initialize_model
model_class, _ = get_model_architecture(model_config)
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 106, in get_model_architecture
architectures = resolve_transformers_fallback(model_config,
File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 69, in resolve_transformers_fallback
raise ValueError(
ValueError: The Transformers implementation of XLMRobertaModel is not compatible with vLLM.
[rank0]:[W223 13:07:13.126509316 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
Traceback (most recent call last):
File "/var/lib/gpustack/bin/vllm_v0.7.3", line 8, in