text-generation-webui Triton default settings not applied with CPU offloading

Triton default settings not applied with CPU offloading

Open WolframRavenwolf opened this issue 1 year ago • 0 comments

Describe the bug

With Change GPTQ triton default settings · 7438f4f did you invert the GPTQ triton flags and they should now be off by default. However, when using CPU offloading with the --pre_layer flag, those features are on and can no longer be disabled with the no-flags gone.

I noticed because it does warmup autotune upon startup, but also causes the "Unexpected mma -> mma layout conversion" exception on my system. Without the no-flags, I can no longer worked around that.

I'm not asking for the no-flags to be returned, instead I think it's a bug that the features are now enabled instead of disabled when using the --pre_layer flag.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

python server.py --auto-launch --chat --model TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g --no-stream --pre_layer 14

Screenshot

No response

Logs

Gradio HTTP request redirected to localhost :)
bin ~/micromamba/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
Loading TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g...
Found the following quantized model: models/TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors
Loading model ...
The safetensors archive passed at models/TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:49<00:00,  4.14s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
  0%|                                                                                                                                                                                                                                                      | 0/12 [00:00<?, ?it/s]
python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
[1]    4196 IOT instruction  python server.py --auto-launch --chat --model  --no-stream --pre_layer 14

System Info

Windows 11, WSL, NVIDIA GeForce RTX 2070 Super

Apr 23 '23 13:04 WolframRavenwolf

text-generation-webui text-generation-webui copied to clipboard

Triton default settings not applied with CPU offloading

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard