vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[New Model]: jinaai/jina-embeddings-v3

Open TC10127 opened this issue 1 year ago • 2 comments

The model to consider.

https://huggingface.co/jinaai/jina-embeddings-v3 jina-embeddings-v3 is a multilingual multi-task text embedding model designed for a variety of NLP applications. Based on the Jina-XLM-RoBERTa architecture, this model supports Rotary Position Embeddings to handle long input sequences up to 8192 tokens. Additionally, it features 5 LoRA adapters to generate task-specific embeddings efficiently.

The closest model vllm already supports.

No response

What's your difficulty of supporting the model you want?

In subsequent version updates, has the project team considered adding support for jinaai/jina-embeddings-v3?

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

TC10127 avatar Jan 17 '25 08:01 TC10127

2025-02-23T13:07:02+00:00 - gpustack.worker.backends.vllm - INFO - Starting vllm server INFO 02-23 13:07:05 init.py:207] Automatically detected platform cuda. INFO 02-23 13:07:05 api_server.py:912] vLLM API server version 0.7.3 INFO 02-23 13:07:05 api_server.py:913] args: Namespace(subparser='serve', model_tag='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', config='', host='0.0.0.0', port=40183, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='transformers', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['jina-embeddings-v3'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f230876f0a0>) INFO 02-23 13:07:05 api_server.py:209] Started engine process with PID 5097 INFO 02-23 13:07:07 init.py:207] Automatically detected platform cuda. INFO 02-23 13:07:08 config.py:422] Found sentence-transformers modules configuration. INFO 02-23 13:07:08 config.py:442] Found pooling configuration. INFO 02-23 13:07:08 config.py:549] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score'}. Defaulting to 'embed'. INFO 02-23 13:07:10 config.py:422] Found sentence-transformers modules configuration. INFO 02-23 13:07:10 config.py:442] Found pooling configuration. INFO 02-23 13:07:10 config.py:549] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed'}. Defaulting to 'embed'. INFO 02-23 13:07:10 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', speculative_config=None, tokenizer='/var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=jina-embeddings-v3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, INFO 02-23 13:07:11 cuda.py:229] Using Flash Attention backend. INFO 02-23 13:07:12 model_runner.py:1110] Starting to load model /var/lib/gpustack/cache/model_scope/jinaai/jina-embeddings-v3... ERROR 02-23 13:07:13 engine.py:400] The Transformers implementation of XLMRobertaModel is not compatible with vLLM. ERROR 02-23 13:07:13 engine.py:400] Traceback (most recent call last): ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine ERROR 02-23 13:07:13 engine.py:400] engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args ERROR 02-23 13:07:13 engine.py:400] return cls(ipc_path=ipc_path, ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 76, in init ERROR 02-23 13:07:13 engine.py:400] self.engine = LLMEngine(*args, **kwargs) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in init ERROR 02-23 13:07:13 engine.py:400] self.model_executor = executor_class(vllm_config=vllm_config, ) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in init ERROR 02-23 13:07:13 engine.py:400] self._init_executor() ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor ERROR 02-23 13:07:13 engine.py:400] self.collective_rpc("load_model") ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 02-23 13:07:13 engine.py:400] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/utils.py", line 2196, in run_method ERROR 02-23 13:07:13 engine.py:400] return func(*args, **kwargs) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model ERROR 02-23 13:07:13 engine.py:400] self.model_runner.load_model() ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1112, in load_model ERROR 02-23 13:07:13 engine.py:400] self.model = get_model(vllm_config=self.vllm_config) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model ERROR 02-23 13:07:13 engine.py:400] return loader.load_model(vllm_config=vllm_config) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model ERROR 02-23 13:07:13 engine.py:400] model = _initialize_model(vllm_config=vllm_config) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 115, in _initialize_model ERROR 02-23 13:07:13 engine.py:400] model_class, _ = get_model_architecture(model_config) ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 106, in get_model_architecture ERROR 02-23 13:07:13 engine.py:400] architectures = resolve_transformers_fallback(model_config, ERROR 02-23 13:07:13 engine.py:400] File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 69, in resolve_transformers_fallback ERROR 02-23 13:07:13 engine.py:400] raise ValueError( ERROR 02-23 13:07:13 engine.py:400] ValueError: The Transformers implementation of XLMRobertaModel is not compatible with vLLM. Process SpawnProcess-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine raise e File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args return cls(ipc_path=ipc_path, File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 76, in init self.engine = LLMEngine(*args, **kwargs) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 273, in init self.model_executor = executor_class(vllm_config=vllm_config, ) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in init self._init_executor() File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor self.collective_rpc("load_model") File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/utils.py", line 2196, in run_method return func(*args, **kwargs) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1112, in load_model self.model = get_model(vllm_config=self.vllm_config) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model return loader.load_model(vllm_config=vllm_config) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model model = _initialize_model(vllm_config=vllm_config) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 115, in _initialize_model model_class, _ = get_model_architecture(model_config) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 106, in get_model_architecture architectures = resolve_transformers_fallback(model_config, File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 69, in resolve_transformers_fallback raise ValueError( ValueError: The Transformers implementation of XLMRobertaModel is not compatible with vLLM. [rank0]:[W223 13:07:13.126509316 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) Traceback (most recent call last): File "/var/lib/gpustack/bin/vllm_v0.7.3", line 8, in sys.exit(main()) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main args.dispatch_function(args) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 34, in cmd uvloop.run(run_server(args)) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/uvloop/init.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper return await main File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/root/.local/share/pipx/venvs/vllm-v0-7-3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.

BenjaminX avatar Feb 23 '25 13:02 BenjaminX