text-generation-inference
text-generation-inference copied to clipboard
Error when deploying inference server with starcoder-gptq
System Info
I tried to quantize for starcoder with the script this repo provided, then deploy it by text-generation-launcher. Got the error when warming up model: Not enough memory to handle 16000 total tokens with 4096 prefill tokens... Traceback in line 21, in matmul_248_kernel KeyError: ('2-.-0-.-0--dxxxxxxxxxxxxxxxxxxxx', (torch.float16, torch.int32))
Decrease the size of max-prefill-tokens didn't solve this problem.
HARDWARE: A800
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
1.quantize starcoder 2.text-generation-launcher
Expected behavior
should deploy successful
Try disabling flash attention, A800 are not supported by it I think.
USE_FLASH_ATTENTION=false
in your env should do.
Try disabling flash attention, A800 are not supported by it I think.
USE_FLASH_ATTENTION=false
in your env should do.
I think it's supoorted. It's fine and fast when using model in foat16, but didn't work when using gptq.
Ah, can you try maybe any other model see if it's maybe GPTQ + triton that doesn't work on A800 ? (Don't have acces rn to reproduce)
I am getting this same error with huggingface/falcon-40b-gptq on a 2xA100 with 80gb of RAM in GKE. I am able to load the falcon-40b model using bitsandbytes quantization without issue though.
I tried with these arguments:
huggingface/falcon-40b-gptq
--model-id=huggingface/falcon-40b-gptq
--quantize=gptq
--num-shard=1
--max-input-length=10
--max-total-tokens=20
--max-batch-total-tokens=20
--max-batch-prefill-tokens=10
--trust-remote-code
Error:
2023-07-17T21:49:55.432568Z ERROR shard-manager: text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "<string>", line 21, in matmul_248_kernel
KeyError: ('2-.-0-.-0--d6252949da17ceb5f3a278a70250af13-1af5134066c618146d2cd009138944a0-235b7327308f95b8bc50ee7abd94d0ab-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (16, 256, 32, 8), (True, True, True, True, True, True, (False, False), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)))
...
2023-07-17T21:49:55.433185Z ERROR warmup{max_input_length=10 max_prefill_tokens=10 max_total_tokens=20}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`
Error: Warmup(Generation("Not enough memory to handle 20 total tokens with 10 prefill tokens. You need to decrease `--max-batch-total-tokens` or `--max-batch-prefill-tokens`"))
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.