text-generation-inference
text-generation-inference copied to clipboard
"TypeError: Descriptors cannot not be created directly"
System Info
I'am on a Ubuntu server of https://console.paperspace.com/ with this GPU :
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro M4000 Off | 00000000:00:05.0 On | N/A |
| 46% 30C P8 14W / 120W | 1840MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
I followed the tutorial except that i'am running the model "lmsys/vicuna-13b-delta-v0", and that i had to add some arguments for other bugs, it gives me :
sudo docker run --net=host --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data -e HF_HUB_ENABLE_HF_TRANSFER=1 ghcr.io/huggingface/text-generation-inference:0.8 --model-id lmsys/vicuna-13b-delta-v0 --num-shard 1 --env --disable-custom-kernels
I had 1 Warning and 2 Errors, below the full errors :
2023-06-13T10:43:25.178238Z WARN shard-manager: text_generation_launcher: Could not import Flash Attention enabled models
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 58, in serve
from text_generation_server import server
File "<frozen importlib._bootstrap>", line 1058, in _handle_fromlist
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 12, in
<module>
from text_generation_server.cache import Cache
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cache.py", line 3, in <module>
from text_generation_server.models.types import Batch
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 29, in <module>
raise ImportError(
ImportError: GPU with CUDA capability 5 2 is not supported
rank=0
2023-06-13T10:43:27.220757Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 246, in get_model
return llama_cls(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 469, in __init__
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/src/transformers/src/transformers/models/auto/tokenization_auto.py", line 692, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
return cls._from_pretrained(
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/src/transformers/src/transformers/models/llama/tokenization_llama_fast.py", line
89, in __init__
super().__init__(
File "/usr/src/transformers/src/transformers/tokenization_utils_fast.py", line 114, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/usr/src/transformers/src/transformers/convert_slow_tokenizer.py", line 1303, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "/usr/src/transformers/src/transformers/convert_slow_tokenizer.py", line 445, in __init__
from .utils import sentencepiece_model_pb2 as model_pb2
File "/usr/src/transformers/src/transformers/utils/sentencepiece_model_pb2.py", line 91, in
<module>
_descriptor.EnumValueDescriptor(
File "/opt/conda/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
rank=0
Error: ShardCannotStart
2023-06-13T10:43:28.049583Z ERROR text_generation_launcher: Shard 0 failed to start:
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 246, in get_model
return llama_cls(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 469, in __init__
tokenizer = AutoTokenizer.from_pretrained(
File "/usr/src/transformers/src/transformers/models/auto/tokenization_auto.py", line 692, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
return cls._from_pretrained(
File "/usr/src/transformers/src/transformers/tokenization_utils_base.py", line 1975, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/src/transformers/src/transformers/models/llama/tokenization_llama_fast.py", line
89, in __init__
super().__init__(
File "/usr/src/transformers/src/transformers/tokenization_utils_fast.py", line 114, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "/usr/src/transformers/src/transformers/convert_slow_tokenizer.py", line 1303, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "/usr/src/transformers/src/transformers/convert_slow_tokenizer.py", line 445, in __init__
from .utils import sentencepiece_model_pb2 as model_pb2
File "/usr/src/transformers/src/transformers/utils/sentencepiece_model_pb2.py", line 91, in
<module>
_descriptor.EnumValueDescriptor(
File "/opt/conda/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
1.start a recommanded Ubuntu server on https://console.paperspace.com/ with one GPU Quadro M400 2.clone the repo 3.execute
sudo docker run --net=host --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data -e HF_HUB_ENABLE_HF_TRANSFER=1 ghcr.io/huggingface/text-generation-inference:0.8 --model-id lmsys/vicuna-13b-delta-v0 --num-shard 1 --env --disable-custom-kernels
Expected behavior
to deploy the Vicuna model without error
The error is in protobuf version, the model you linked doesn't use a fast tokenizer (which is needed for additional checks in text-generation-inference
) and the script fails during the conversion because sentencepiece is still not protobuf==4.20 enabled.
You can:
- Downgrade protobuf;
tokenizer = AutoTokenizer.from_pretrained("..."); tokenizer.save_pretrained("./local")
to get the correct tokenizer, and then everything should work. - Use a different model that already has the correct tokenizer.
Note: The GPU you're using hasn't flash attention support, so please bear in mind you won't benefit all features from this repo.
Thanks @Narsil for you answer, but i have 3 questions :
- how can i downgrade protobuf ? where have i to put those lines of codes in the repo ?
- how can i know in advance if a model has the correct tokenizer ?
Best,
-
pip install protobuf == 3.19
- Check for
tokenizer.json
in the repo, that's the file used by fast tokenizers. Usually we can create a fast from a slow, but it requires the protobuf thing, and is also excruciatingly slow (there's a O(n²) search to recreate state the original sentencepiece is lacking)