text-generation-inference Error when deploying inference server with starcoder-gptq

System Info

I tried to quantize for starcoder with the script this repo provided, then deploy it by text-generation-launcher. Got the error when warming up model: Not enough memory to handle 16000 total tokens with 4096 prefill tokens... Traceback in line 21, in matmul_248_kernel KeyError: ('2-.-0-.-0--dxxxxxxxxxxxxxxxxxxxx', (torch.float16, torch.int32))

Decrease the size of max-prefill-tokens didn't solve this problem.

HARDWARE: A800

Information

[X] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

1.quantize starcoder 2.text-generation-launcher

Expected behavior

should deploy successful

Jul 12 '23 09:07 Porraio

Try disabling flash attention, A800 are not supported by it I think.

USE_FLASH_ATTENTION=false in your env should do.

Jul 12 '23 13:07 Narsil

Try disabling flash attention, A800 are not supported by it I think.

USE_FLASH_ATTENTION=false in your env should do.

I think it's supoorted. It's fine and fast when using model in foat16, but didn't work when using gptq.

Jul 13 '23 01:07 Porraio

Ah, can you try maybe any other model see if it's maybe GPTQ + triton that doesn't work on A800 ? (Don't have acces rn to reproduce)

Jul 14 '23 11:07 Narsil

I am getting this same error with huggingface/falcon-40b-gptq on a 2xA100 with 80gb of RAM in GKE. I am able to load the falcon-40b model using bitsandbytes quantization without issue though.

I tried with these arguments:

huggingface/falcon-40b-gptq
--model-id=huggingface/falcon-40b-gptq
--quantize=gptq
--num-shard=1
--max-input-length=10
--max-total-tokens=20
--max-batch-total-tokens=20
--max-batch-prefill-tokens=10
--trust-remote-code

Error:

2023-07-17T21:49:55.432568Z ERROR shard-manager: text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "<string>", line 21, in matmul_248_kernel
KeyError: ('2-.-0-.-0--d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-235b7327308f95b8bc50ee7abd94d0ab-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (16, 256, 32, 8), (True, True, True, True, True, True, (False, False), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)))

...

2023-07-17T21:49:55.433185Z ERROR warmup{max_input_length=10 max_prefill_tokens=10 max_total_tokens=20}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`"))

Jul 17 '23 21:07 hollinwilkins

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 02 '24 01:05 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Error when deploying inference server with starcoder-gptq

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard