text-generation-inference Unsupported model type xlm-roberta

Unsupported model type xlm-roberta

Open elvizlai opened this issue 8 months ago • 0 comments

System Info

docker deploy

$ nvidia-smi
Thu Feb 13 23:44:10 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:4F:00.0 Off |                    0 |
| N/A   63C    P0            297W /  300W |   41431MiB /  81920MiB |     98%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:52:00.0 Off |                    0 |
| N/A   62C    P0            155W /  300W |   41437MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off |   00000000:56:00.0 Off |                    0 |
| N/A   65C    P0            165W /  300W |   41397MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off |   00000000:57:00.0 Off |                    0 |
| N/A   35C    P0             48W /  300W |      14MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L40S                    Off |   00000000:CE:00.0 Off |                    0 |
| N/A   37C    P0             95W /  350W |    2266MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L40S                    Off |   00000000:D1:00.0 Off |                    0 |
| N/A   36C    P0             91W /  350W |     876MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L40S                    Off |   00000000:D5:00.0 Off |                    0 |
| N/A   38C    P0             97W /  350W |   19149MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L40S                    Off |   00000000:D6:00.0 Off |                    0 |
| N/A   38C    P0             96W /  350W |   19187MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Information

[x] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

2025-02-13T15:38:51.005264Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2025-02-13T15:38:52.028582Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 10, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 743, in main
    return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
  File "/usr/src/server/text_generation_server/cli.py", line 119, in serve
    server.serve(
  File "/usr/src/server/text_generation_server/server.py", line 315, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
> File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner
    model = get_model_with_lora_adapters(
  File "/usr/src/server/text_generation_server/models/__init__.py", line 1542, in get_model_with_lora_adapters
    model = get_model(
  File "/usr/src/server/text_generation_server/models/__init__.py", line 1523, in get_model
    raise ValueError(f"Unsupported model type {model_type}")
ValueError: Unsupported model type xlm-roberta
2025-02-13T15:38:53.307253Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-02-13 15:38:42.621 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cuda
/usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/opt/conda/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/src/server/text_generation_server/cli.py:119 in serve                   │
│                                                                              │
│   116 │   │   raise RuntimeError(                                            │
│   117 │   │   │   "Only 1 can be set between `dtype` and `quantize`, as they │
│   118 │   │   )                                                              │
│ ❱ 119 │   server.serve(                                                      │
│   120 │   │   model_id,                                                      │
│   121 │   │   lora_adapters,                                                 │
│   122 │   │   revision,                                                      │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │             dtype = None                                                 │ │
│ │       json_output = True                                                 │ │
│ │    kv_cache_dtype = None                                                 │ │
│ │      logger_level = 'INFO'                                               │ │
│ │     lora_adapters = []                                                   │ │
│ │  max_input_tokens = None                                                 │ │
│ │          model_id = 'BAAI/bge-m3'                                        │ │
│ │     otlp_endpoint = None                                                 │ │
│ │ otlp_service_name = 'text-generation-inference.router'                   │ │
│ │          quantize = None                                                 │ │
│ │          revision = None                                                 │ │
│ │            server = <module 'text_generation_server.server' from         │ │
│ │                     '/usr/src/server/text_generation_server/server.py'>  │ │
│ │           sharded = False                                                │ │
│ │         speculate = None                                                 │ │
│ │ trust_remote_code = False                                                │ │
│ │          uds_path = PosixPath('/tmp/text-generation-server')             │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/src/server/text_generation_server/server.py:315 in serve                │
│                                                                              │
│   312 │   │   while signal_handler.KEEP_PROCESSING:                          │
│   313 │   │   │   await asyncio.sleep(0.5)                                   │
│   314 │                                                                      │
│ ❱ 315 │   asyncio.run(                                                       │
│   316 │   │   serve_inner(                                                   │
│   317 │   │   │   model_id,                                                  │
│   318 │   │   │   lora_adapters,                                             │
│                                                                              │
│ ╭─────────────────────────── locals ───────────────────────────╮             │
│ │             dtype = None                                     │             │
│ │    kv_cache_dtype = None                                     │             │
│ │     lora_adapters = []                                       │             │
│ │  max_input_tokens = None                                     │             │
│ │          model_id = 'BAAI/bge-m3'                            │             │
│ │          quantize = None                                     │             │
│ │          revision = None                                     │             │
│ │           sharded = False                                    │             │
│ │         speculate = None                                     │             │
│ │ trust_remote_code = False                                    │             │
│ │          uds_path = PosixPath('/tmp/text-generation-server') │             │
│ ╰──────────────────────────────────────────────────────────────╯             │
│                                                                              │
│ /opt/conda/lib/python3.11/asyncio/runners.py:190 in run                      │
│                                                                              │
│   187 │   │   │   "asyncio.run() cannot be called from a running event loop" │
│   188 │                                                                      │
│   189 │   with Runner(debug=debug) as runner:                                │
│ ❱ 190 │   │   return runner.run(main)                                        │
│   191                                                                        │
│   192                                                                        │
│   193 def _cancel_all_tasks(loop):                                           │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │  debug = None                                                            │ │
│ │   main = <coroutine object serve.<locals>.serve_inner at 0x7f8b0b7f1480> │ │
│ │ runner = <asyncio.runners.Runner object at 0x7f8b09e03890>               │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.11/asyncio/runners.py:118 in run                      │
│                                                                              │
│   115 │   │                                                                  │
│   116 │   │   self._interrupt_count = 0                                      │
│   117 │   │   try:                                                           │
│ ❱ 118 │   │   │   return self._loop.run_until_complete(task)                 │
│   119 │   │   except exceptions.CancelledError:                              │
│   120 │   │   │   if self._interrupt_count > 0:                              │
│   121 │   │   │   │   uncancel = getattr(task, "uncancel", None)             │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │        context = <_contextvars.Context object at 0x7f8b0a25d0c0>         │ │
│ │           coro = <coroutine object serve.<locals>.serve_inner at         │ │
│ │                  0x7f8b0b7f1480>                                         │ │
│ │           self = <asyncio.runners.Runner object at 0x7f8b09e03890>       │ │
│ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of    │ │
│ │                  <asyncio.runners.Runner object at 0x7f8b09e03890>>,     │ │
│ │                  main_task=<Task finished name='Task-1'                  │ │
│ │                  coro=<serve.<locals>.serve_inner() done, defined at     │ │
│ │                  /usr/src/server/text_generation_server/server.py:244>   │ │
│ │                  exception=ValueError('Unsupported model type            │ │
│ │                  xlm-roberta')>)                                         │ │
│ │           task = <Task finished name='Task-1'                            │ │
│ │                  coro=<serve.<locals>.serve_inner() done, defined at     │ │
│ │                  /usr/src/server/text_generation_server/server.py:244>   │ │
│ │                  exception=ValueError('Unsupported model type            │ │
│ │                  xlm-roberta')>                                          │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /opt/conda/lib/python3.11/asyncio/base_events.py:654 in run_until_complete   │
│                                                                              │
│    651 │   │   if not future.done():                                         │
│    652 │   │   │   raise RuntimeError('Event loop stopped before Future comp │
│    653 │   │                                                                 │
│ ❱  654 │   │   return future.result()                                        │
│    655 │                                                                     │
│    656 │   def stop(self):                                                   │
│    657 │   │   """Stop running the event loop.                               │
│                                                                              │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │   future = <Task finished name='Task-1'                                  │ │
│ │            coro=<serve.<locals>.serve_inner() done, defined at           │ │
│ │            /usr/src/server/text_generation_server/server.py:244>         │ │
│ │            exception=ValueError('Unsupported model type xlm-roberta')>   │ │
│ │ new_task = False                                                         │ │
│ │     self = <_UnixSelectorEventLoop running=False closed=True             │ │
│ │            debug=False>                                                  │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│                                                                              │
│ /usr/src/server/text_generation_server/server.py:268 in serve_inner          │
│                                                                              │
│   265 │   │   │   server_urls = [local_url]                                  │
│   266 │   │                                                                  │
│   267 │   │   try:                                                           │
│ ❱ 268 │   │   │   model = get_model_with_lora_adapters(                      │
│   269 │   │   │   │   model_id,                                              │
│   270 │   │   │   │   lora_adapters,                                         │
│   271 │   │   │   │   revision,                                              │
│                                                                              │
│ ╭──────────────────────────── locals ─────────────────────────────╮          │
│ │     adapter_to_index = {}                                       │          │
│ │                dtype = None                                     │          │
│ │       kv_cache_dtype = None                                     │          │
│ │            local_url = 'unix:///tmp/text-generation-server-0'   │          │
│ │        lora_adapters = []                                       │          │
│ │     max_input_tokens = None                                     │          │
│ │             model_id = 'BAAI/bge-m3'                            │          │
│ │             quantize = None                                     │          │
│ │             revision = None                                     │          │
│ │          server_urls = ['unix:///tmp/text-generation-server-0'] │          │
│ │              sharded = False                                    │          │
│ │            speculate = None                                     │          │
│ │    trust_remote_code = False                                    │          │
│ │             uds_path = PosixPath('/tmp/text-generation-server') │          │
│ │ unix_socket_template = 'unix://{}-{}'                           │          │
│ ╰─────────────────────────────────────────────────────────────────╯          │
│                                                                              │
│ /usr/src/server/text_generation_server/models/__init__.py:1542 in            │
│ get_model_with_lora_adapters                                                 │
│                                                                              │
│   1539 │   adapter_to_index: Dict[str, int],                                 │
│   1540 ):                                                                    │
│   1541 │   lora_adapter_ids = [adapter.id for adapter in lora_adapters]      │
│ ❱ 1542 │   model = get_model(                                                │
│   1543 │   │   model_id,                                                     │
│   1544 │   │   lora_adapter_ids,                                             │
│   1545 │   │   revision,                                                     │
│                                                                              │
│ ╭───────────── locals ──────────────╮                                        │
│ │  adapter_to_index = {}            │                                        │
│ │             dtype = None          │                                        │
│ │    kv_cache_dtype = None          │                                        │
│ │  lora_adapter_ids = []            │                                        │
│ │     lora_adapters = []            │                                        │
│ │  max_input_tokens = None          │                                        │
│ │          model_id = 'BAAI/bge-m3' │                                        │
│ │          quantize = None          │                                        │
│ │          revision = None          │                                        │
│ │           sharded = False         │                                        │
│ │         speculate = None          │                                        │
│ │ trust_remote_code = False         │                                        │
│ ╰───────────────────────────────────╯                                        │
│                                                                              │
│ /usr/src/server/text_generation_server/models/__init__.py:1523 in get_model  │
│                                                                              │
│   1520 │   │   │   │   trust_remote_code=trust_remote_code,                  │
│   1521 │   │   │   )                                                         │
│   1522 │                                                                     │
│ ❱ 1523 │   raise ValueError(f"Unsupported model type {model_type}")          │
│   1524                                                                       │
│   1525                                                                       │
│   1526 # get_model_with_lora_adapters wraps the internal get_model function  │
│                                                                              │
│ ╭─────────────────────────────── locals ────────────────────────────────╮    │
│ │                         _ = {}                                        │    │
│ │                  auto_map = None                                      │    │
│ │ compressed_tensors_config = None                                      │    │
│ │               config_dict = {                                         │    │
│ │                             │   '_name_or_path': '',                  │    │
│ │                             │   'architectures': [                    │    │
│ │                             │   │   'XLMRobertaModel'                 │    │
│ │                             │   ],                                    │    │
│ │                             │   'attention_probs_dropout_prob': 0.1,  │    │
│ │                             │   'bos_token_id': 0,                    │    │
│ │                             │   'classifier_dropout': None,           │    │
│ │                             │   'eos_token_id': 2,                    │    │
│ │                             │   'hidden_act': 'gelu',                 │    │
│ │                             │   'hidden_dropout_prob': 0.1,           │    │
│ │                             │   'hidden_size': 1024,                  │    │
│ │                             │   'initializer_range': 0.02,            │    │
│ │                             │   ... +15                               │    │
│ │                             }                                         │    │
│ │                     dtype = None                                      │    │
│ │            kv_cache_dtype = None                                      │    │
│ │           kv_cache_scheme = None                                      │    │
│ │          lora_adapter_ids = []                                        │    │
│ │          max_input_tokens = None                                      │    │
│ │                    method = 'n-gram'                                  │    │
│ │                  model_id = 'BAAI/bge-m3'                             │    │
│ │                model_type = 'xlm-roberta'                             │    │
│ │      needs_sliding_window = False                                     │    │
│ │       quantization_config = None                                      │    │
│ │                  quantize = None                                      │    │
│ │                  revision = None                                      │    │
│ │                   sharded = False                                     │    │
│ │            sliding_window = -1                                        │    │
│ │                 speculate = 0                                         │    │
│ │                speculator = None                                      │    │
│ │         trust_remote_code = False                                     │    │
│ │        use_sliding_window = False                                     │    │
│ ╰───────────────────────────────────────────────────────────────────────╯    │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: Unsupported model type xlm-roberta rank=0

Expected behavior

should works

model=BAAI/bge-m3 volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus '"device=4"' --shm-size 64g -p 10003:80 -v $volume:/data
ghcr.io/huggingface/text-generation-inference:3.1.0
--model-id $model

Feb 13 '25 15:02 elvizlai

text-generation-inference text-generation-inference copied to clipboard

Unsupported model type xlm-roberta

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard