text-generation-inference
text-generation-inference copied to clipboard
Unsupported model type xlm-roberta
System Info
docker deploy
$ nvidia-smi
Thu Feb 13 23:44:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:4F:00.0 Off | 0 |
| N/A 63C P0 297W / 300W | 41431MiB / 81920MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:52:00.0 Off | 0 |
| N/A 62C P0 155W / 300W | 41437MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:56:00.0 Off | 0 |
| N/A 65C P0 165W / 300W | 41397MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:57:00.0 Off | 0 |
| N/A 35C P0 48W / 300W | 14MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA L40S Off | 00000000:CE:00.0 Off | 0 |
| N/A 37C P0 95W / 350W | 2266MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA L40S Off | 00000000:D1:00.0 Off | 0 |
| N/A 36C P0 91W / 350W | 876MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA L40S Off | 00000000:D5:00.0 Off | 0 |
| N/A 38C P0 97W / 350W | 19149MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA L40S Off | 00000000:D6:00.0 Off | 0 |
| N/A 38C P0 96W / 350W | 19187MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
2025-02-13T15:38:51.005264Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-02-13T15:38:52.028582Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 10, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
return callback(**use_params)
File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
server.serve(
File "/usr/src/server/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
> File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner
model = get_model_with_lora_adapters(
File "/usr/src/server/text_generation_server/models/__init__.py", line 1542, in get_model_with_lora_adapters
model = get_model(
File "/usr/src/server/text_generation_server/models/__init__.py", line 1523, in get_model
raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type xlm-roberta
2025-02-13T15:38:53.307253Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-02-13 15:38:42.621 | INFO | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/src/server/text_generation_server/cli.py:119 in serve │
│ │
│ 116 │ │ raise RuntimeError( │
│ 117 │ │ │ "Only 1 can be set between `dtype` and `quantize`, as they │
│ 118 │ │ ) │
│ ❱ 119 │ server.serve( │
│ 120 │ │ model_id, │
│ 121 │ │ lora_adapters, │
│ 122 │ │ revision, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ kv_cache_dtype = None │ │
│ │ logger_level = 'INFO' │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'BAAI/bge-m3' │ │
│ │ otlp_endpoint = None │ │
│ │ otlp_service_name = 'text-generation-inference.router' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server = <module 'text_generation_server.server' from │ │
│ │ '/usr/src/server/text_generation_server/server.py'> │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:315 in serve │
│ │
│ 312 │ │ while signal_handler.KEEP_PROCESSING: │
│ 313 │ │ │ await asyncio.sleep(0.5) │
│ 314 │ │
│ ❱ 315 │ asyncio.run( │
│ 316 │ │ serve_inner( │
│ 317 │ │ │ model_id, │
│ 318 │ │ │ lora_adapters, │
│ │
│ ╭─────────────────────────── locals ───────────────────────────╮ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'BAAI/bge-m3' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────╯ │
│ │
│ /opt/conda/lib/python3.11/asyncio/runners.py:190 in run │
│ │
│ 187 │ │ │ "asyncio.run() cannot be called from a running event loop" │
│ 188 │ │
│ 189 │ with Runner(debug=debug) as runner: │
│ ❱ 190 │ │ return runner.run(main) │
│ 191 │
│ 192 │
│ 193 def _cancel_all_tasks(loop): │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ debug = None │ │
│ │ main = <coroutine object serve.<locals>.serve_inner at 0x7f8b0b7f1480> │ │
│ │ runner = <asyncio.runners.Runner object at 0x7f8b09e03890> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /opt/conda/lib/python3.11/asyncio/runners.py:118 in run │
│ │
│ 115 │ │ │
│ 116 │ │ self._interrupt_count = 0 │
│ 117 │ │ try: │
│ ❱ 118 │ │ │ return self._loop.run_until_complete(task) │
│ 119 │ │ except exceptions.CancelledError: │
│ 120 │ │ │ if self._interrupt_count > 0: │
│ 121 │ │ │ │ uncancel = getattr(task, "uncancel", None) │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ context = <_contextvars.Context object at 0x7f8b0a25d0c0> │ │
│ │ coro = <coroutine object serve.<locals>.serve_inner at │ │
│ │ 0x7f8b0b7f1480> │ │
│ │ self = <asyncio.runners.Runner object at 0x7f8b09e03890> │ │
│ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of │ │
│ │ <asyncio.runners.Runner object at 0x7f8b09e03890>>, │ │
│ │ main_task=<Task finished name='Task-1' │ │
│ │ coro=<serve.<locals>.serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:244> │ │
│ │ exception=ValueError('Unsupported model type │ │
│ │ xlm-roberta')>) │ │
│ │ task = <Task finished name='Task-1' │ │
│ │ coro=<serve.<locals>.serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:244> │ │
│ │ exception=ValueError('Unsupported model type │ │
│ │ xlm-roberta')> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /opt/conda/lib/python3.11/asyncio/base_events.py:654 in run_until_complete │
│ │
│ 651 │ │ if not future.done(): │
│ 652 │ │ │ raise RuntimeError('Event loop stopped before Future comp │
│ 653 │ │ │
│ ❱ 654 │ │ return future.result() │
│ 655 │ │
│ 656 │ def stop(self): │
│ 657 │ │ """Stop running the event loop. │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ future = <Task finished name='Task-1' │ │
│ │ coro=<serve.<locals>.serve_inner() done, defined at │ │
│ │ /usr/src/server/text_generation_server/server.py:244> │ │
│ │ exception=ValueError('Unsupported model type xlm-roberta')> │ │
│ │ new_task = False │ │
│ │ self = <_UnixSelectorEventLoop running=False closed=True │ │
│ │ debug=False> │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:268 in serve_inner │
│ │
│ 265 │ │ │ server_urls = [local_url] │
│ 266 │ │ │
│ 267 │ │ try: │
│ ❱ 268 │ │ │ model = get_model_with_lora_adapters( │
│ 269 │ │ │ │ model_id, │
│ 270 │ │ │ │ lora_adapters, │
│ 271 │ │ │ │ revision, │
│ │
│ ╭──────────────────────────── locals ─────────────────────────────╮ │
│ │ adapter_to_index = {} │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ local_url = 'unix:///tmp/text-generation-server-0' │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'BAAI/bge-m3' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server_urls = ['unix:///tmp/text-generation-server-0'] │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ │ unix_socket_template = 'unix://{}-{}' │ │
│ ╰─────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/models/__init__.py:1542 in │
│ get_model_with_lora_adapters │
│ │
│ 1539 │ adapter_to_index: Dict[str, int], │
│ 1540 ): │
│ 1541 │ lora_adapter_ids = [adapter.id for adapter in lora_adapters] │
│ ❱ 1542 │ model = get_model( │
│ 1543 │ │ model_id, │
│ 1544 │ │ lora_adapter_ids, │
│ 1545 │ │ revision, │
│ │
│ ╭───────────── locals ──────────────╮ │
│ │ adapter_to_index = {} │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapter_ids = [] │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'BAAI/bge-m3' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ ╰───────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/models/__init__.py:1523 in get_model │
│ │
│ 1520 │ │ │ │ trust_remote_code=trust_remote_code, │
│ 1521 │ │ │ ) │
│ 1522 │ │
│ ❱ 1523 │ raise ValueError(f"Unsupported model type {model_type}") │
│ 1524 │
│ 1525 │
│ 1526 # get_model_with_lora_adapters wraps the internal get_model function │
│ │
│ ╭─────────────────────────────── locals ────────────────────────────────╮ │
│ │ _ = {} │ │
│ │ auto_map = None │ │
│ │ compressed_tensors_config = None │ │
│ │ config_dict = { │ │
│ │ │ '_name_or_path': '', │ │
│ │ │ 'architectures': [ │ │
│ │ │ │ 'XLMRobertaModel' │ │
│ │ │ ], │ │
│ │ │ 'attention_probs_dropout_prob': 0.1, │ │
│ │ │ 'bos_token_id': 0, │ │
│ │ │ 'classifier_dropout': None, │ │
│ │ │ 'eos_token_id': 2, │ │
│ │ │ 'hidden_act': 'gelu', │ │
│ │ │ 'hidden_dropout_prob': 0.1, │ │
│ │ │ 'hidden_size': 1024, │ │
│ │ │ 'initializer_range': 0.02, │ │
│ │ │ ... +15 │ │
│ │ } │ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ kv_cache_scheme = None │ │
│ │ lora_adapter_ids = [] │ │
│ │ max_input_tokens = None │ │
│ │ method = 'n-gram' │ │
│ │ model_id = 'BAAI/bge-m3' │ │
│ │ model_type = 'xlm-roberta' │ │
│ │ needs_sliding_window = False │ │
│ │ quantization_config = None │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ sliding_window = -1 │ │
│ │ speculate = 0 │ │
│ │ speculator = None │ │
│ │ trust_remote_code = False │ │
│ │ use_sliding_window = False │ │
│ ╰───────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: Unsupported model type xlm-roberta rank=0
Expected behavior
should works
model=BAAI/bge-m3 volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus '"device=4"' --shm-size 64g -p 10003:80 -v $volume:/data
ghcr.io/huggingface/text-generation-inference:3.1.0
--model-id $model