text-generation-inference How to import chatglm model

System Info

text-generation-inference: v0.7.0 python: 3.9 Operation System: Ubuntu 18.04

When loading chatglm model use command:

docker run --gpus '"device=3"' --shm-size 1g -p 8083:80 -v /data/llm:/data ghcr.io/huggingface/text-generation-inference:latest --model-id /data/chatglm-6b --num-shard 1 --max-total-tokens 2048 --max-concurrent-requests 5 --trust-remote-code

server can't start successful.

The logs is as below:

2023-05-24T06:00:03.978404Z INFO text_generation_launcher: Args { model_id: "/data/chatglm-6b", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: true, max_concurrent_requests: 5, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 2048, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false } 2023-05-24T06:00:03.978503Z INFO text_generation_launcher: Starting download process. 2023-05-24T06:00:05.441414Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-05-24T06:00:05.781468Z INFO text_generation_launcher: Successfully downloaded weights. 2023-05-24T06:00:05.781509Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model /data/chatglm-6b do not contain malicious code. 2023-05-24T06:00:05.781529Z WARN text_generation_launcher: Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. 2023-05-24T06:00:05.782032Z INFO text_generation_launcher: Starting shard 0 2023-05-24T06:00:08.251133Z ERROR shard-manager: text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 157, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 126, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 137, in get_model config = AutoConfig.from_pretrained( File "/usr/src/transformers/src/transformers/models/auto/configuration_auto.py", line 925, in from_pretrained raise ValueError( ValueError: Loading /data/chatglm-6b requires you to execute the configuration file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code=True to remove this error. rank=0

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

download glm model
docker run --gpus '"device=3"' --shm-size 1g -p 8083:80 -v /data/llm:/data ghcr.io/huggingface/text-generation-inference:latest --model-id /data/chatglm-6b --num-shard 1 --max-total-tokens 2048 --max-concurrent-requests 5 --trust-remote-code

Expected behavior

load glm successful

May 24 '23 06:05 askintution

chatglm-6b is not supported at the moment as it requires additional python dependencies.

May 24 '23 10:05 OlivierDehaene

Can you tell me how to develop to support it? @OlivierDehaene

Jul 03 '23 04:07 exceedzhang

chatglm-6b is not supported at the moment as it requires additional python dependencies.

Is there any way to deploy chatglm through TGI?

Jul 19 '23 15:07 lyj8330328

chatglm-6b is not supported at the moment as it requires additional python dependencies.

Is there any way to deploy chatglm through TGI?

use the 0.9.1 docker image can run the chatglm2-6b ， with args --trust-remote-code

Jul 20 '23 07:07 zTaoplus

Closing this issue then ! Thanks for sharing @zTaoplus

Jul 21 '23 06:07 Narsil

chatglm-6b is not supported at the moment as it requires additional python dependencies.

Is there any way to deploy chatglm through TGI?

use the 0.9.1 docker image can run the chatglm2-6b ， with args --trust-remote-code

I can't reproduce it. can you show more detail?

Jul 27 '23 01:07 garycaokai

chatglm-6b is not supported at the moment as it requires additional python dependencies.

Is there any way to deploy chatglm through TGI?

use the 0.9.1 docker image can run the chatglm2-6b ， with args --trust-remote-code

I can't reproduce it. can you show more detail?

I have downloaded the chatglm2-6b model weights to the local /data/chatglm directory and the TGI running parameters are as follows: --model-id /data/chatglm --max-input-length 4096 --max-total-tokens 12888 --trust-remote-code

I'm not sure if this is a Docker image issue. I can't pull the image of ghcr.io in my cluster, so I actually used registry.cn-hangzhou.aliyuncs.com/zt_gcr/hf-infer:v0.9.1，it build from here ，this is just so that I can pull it.

Jul 27 '23 07:07 zTaoplus

thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" }

File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]

Jul 27 '23 09:07 garycaokai

thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" }

File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]

I got the similar error as below: infer:send_error: text_generation_router::infer: router/src/infer.rs:554: Request failed during generation: Server error: The expanded size of the tensor (273) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 273, 273, 128]. Tensor sizes: [273, 2, 128]

Jul 28 '23 03:07 ijustloveses

thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" } File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]

I got the similar error as below: infer:send_error: text_generation_router::infer: router/src/infer.rs:554: Request failed during generation: Server error: The expanded size of the tensor (273) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 273, 273, 128]. Tensor sizes: [273, 2, 128]

i also got the similar error, you know how to fix it?

Jul 31 '23 02:07 aphrodite1028

thank you, chatglm2-6b works in AutoModelForCausalLM mode and the batch size shoud be 1. when batch > 1, I got error below : { "error": "Request failed during generation: Server error: The expanded size of the tensor (24) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 24, 24, 128]. Tensor sizes: [24, 2, 128]", "error_type": "generation" }

File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 98, in Decode batch = self.model.batch_type.concatenate(batches) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 392, in concatenate padded_past_keys[ RuntimeError: The expanded size of the tensor (6) must match the existing size (2) at non-singleton dimension 2. Target sizes: [1, 6, 6, 128]. Tensor sizes: [6, 2, 128]

how to set batch size 1?

Jul 31 '23 12:07 aphrodite1028

--max-concurrent-requests 1

Aug 01 '23 03:08 garycaokai

--max-concurrent-requests 1

thanks, i will try it

Aug 01 '23 05:08 aphrodite1028

If that works, it's likely to kill throughput... Batching is how we get throughput.

Aug 01 '23 12:08 Narsil

text-generation-inference text-generation-inference copied to clipboard

How to import chatglm model

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard