text-generation-webui
text-generation-webui copied to clipboard
Triton default settings not applied with CPU offloading
Describe the bug
With Change GPTQ triton default settings · 7438f4f did you invert the GPTQ triton flags and they should now be off by default. However, when using CPU offloading with the --pre_layer
flag, those features are on and can no longer be disabled with the no-flags gone.
I noticed because it does warmup autotune upon startup, but also causes the "Unexpected mma -> mma layout conversion" exception on my system. Without the no-flags, I can no longer worked around that.
I'm not asking for the no-flags to be returned, instead I think it's a bug that the features are now enabled instead of disabled when using the --pre_layer
flag.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
python server.py --auto-launch --chat --model TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g --no-stream --pre_layer 14
Screenshot
No response
Logs
Gradio HTTP request redirected to localhost :)
bin ~/micromamba/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
Loading TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g...
Found the following quantized model: models/TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors
Loading model ...
The safetensors archive passed at models/TheBloke_vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Found 3 unique KN Linear values.
Warming up autotune cache ...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:49<00:00, 4.14s/it]
Found 1 unique fused mlp KN values.
Warming up autotune cache ...
0%| | 0/12 [00:00<?, ?it/s]
python: /opt/conda/conda-bld/torchtriton_1677881345124/work/lib/Analysis/Allocation.cpp:42: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(const mlir::Attribute&, const mlir::Attribute&): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
[1] 4196 IOT instruction python server.py --auto-launch --chat --model --no-stream --pre_layer 14
System Info
Windows 11, WSL, NVIDIA GeForce RTX 2070 Super