text-generation-inference
text-generation-inference copied to clipboard
How to import chatglm model
System Info
text-generation-inference: v0.7.0 python: 3.9 Operation System: Ubuntu 18.04
When loading chatglm model use command:
docker run --gpus '"device=3"' --shm-size 1g -p 8083:80 -v /data/llm:/data ghcr.io/huggingface/text-generation-inference:latest --model-id /data/chatglm-6b --num-shard 1 --max-total-tokens 2048 --max-concurrent-requests 5 --trust-remote-code
server can't start successful.
The logs is as below:
2023-05-24T06:00:03.978404Z INFO text_generation_launcher: Args { model_id: "/data/chatglm-6b", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: true, max_concurrent_requests: 5, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false } 2023-05-24T06:00:03.978503Z INFO text_generation_launcher: Starting download process. 2023-05-24T06:00:05.441414Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-05-24T06:00:05.781468Z INFO text_generation_launcher: Successfully downloaded weights.
2023-05-24T06:00:05.781509Z WARN text_generation_launcher: trust_remote_code
is set. Trusting that model /data/chatglm-6b
do not contain malicious code.
2023-05-24T06:00:05.781529Z WARN text_generation_launcher: Explicitly passing a revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
2023-05-24T06:00:05.782032Z INFO text_generation_launcher: Starting shard 0
2023-05-24T06:00:08.251133Z ERROR shard-manager: text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 126, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 137, in get_model config = AutoConfig.from_pretrained( File "/usr/src/transformers/src/transformers/models/auto/configuration_auto.py", line 925, in from_pretrained raise ValueError( ValueError: Loading /data/chatglm-6b requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option
trust_remote_code=True
to remove this error. rank=0
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- download glm model
- docker run --gpus '"device=3"' --shm-size 1g -p 8083:80 -v /data/llm:/data ghcr.io/huggingface/text-generation-inference:latest --model-id /data/chatglm-6b --num-shard 1 --max-total-tokens 2048 --max-concurrent-requests 5 --trust-remote-code
Expected behavior
load glm successful
chatglm-6b is not supported at the moment as it requires additional python dependencies.
Can you tell me how to develop to support it? @OlivierDehaene
chatglm-6b is not supported at the moment as it requires additional python dependencies.
Is there any way to deploy chatglm through TGI?
chatglm-6b is not supported at the moment as it requires additional python dependencies.
Is there any way to deploy chatglm through TGI?
use the 0.9.1 docker image can run the chatglm2-6b , with args --trust-remote-code
Closing this issue then ! Thanks for sharing @zTaoplus
chatglm-6b is not supported at the moment as it requires additional python dependencies.
Is there any way to deploy chatglm through TGI?
use the 0.9.1 docker image can run the chatglm2-6b , with args
--trust-remote-code
I can't reproduce it. can you show more detail?
chatglm-6b is not supported at the moment as it requires additional python dependencies.
Is there any way to deploy chatglm through TGI?
use the 0.9.1 docker image can run the chatglm2-6b , with args
--trust-remote-code
I can't reproduce it. can you show more detail?
I have downloaded the chatglm2-6b model weights to the local /data/chatglm
directory and the TGI running parameters are as follows:
--model-id /data/chatglm --max-input-length 4096 --max-total-tokens 12888 --trust-remote-code
I'm not sure if this is a Docker image issue. I can't pull the image of ghcr.io
in my cluster,
so I actually used registry.cn-hangzhou.aliyuncs.com/zt_gcr/hf-infer:v0.9.1
,it build from here
,this is just so that I can pull it.
thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" }
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]
thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" }
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]
I got the similar error as below: infer:send_error: text_generation_router::infer: router/src/infer.rs:554: Request failed during generation: Server error: The expanded size of the tensor (273) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 273, 273, 128]. Tensor sizes: [273, 2, 128]
thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" } File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]
I got the similar error as below: infer:send_error: text_generation_router::infer: router/src/infer.rs:554: Request failed during generation: Server error: The expanded size of the tensor (273) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 273, 273, 128]. Tensor sizes: [273, 2, 128]
i also got the similar error, you know how to fix it?
thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" }
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]
how to set batch size 1?
--max-concurrent-requests 1
--max-concurrent-requests 1
thanks, i will try it
If that works, it's likely to kill throughput... Batching is how we get throughput.