AutoGPTQ
AutoGPTQ copied to clipboard
Advice for quantizing BLOOMZ 175B
Hi @PanQiWei
I'd be most grateful if you could give me a bit of help.
I have been trying to quantize BLOOMZ 175B but can't currently get it done. BLOOMZ has 70 layers, and is a total of 360GB.
Hardware: 4 x A100 80G with 500GB RAM
Basic setup:
model = AutoGPTQForCausalLM.from_pretrained(model_dir,
quantize_config=quantize_config, # bits = 4, group_size = -1, desc_act = True
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
trust_remote_code=False)
model.quantize(traindataset, use_triton=False, batch_size=1)
Attempt 1: no max_memory
result:
- OOM on GPU 1 at around layer 45 of 70
- zero VRAM usage on GPUs 2, 3, 4. Only GPU 1 is used
- GPU 1 starts around ~40GB VRAM used, goes up and down, until eventually OOM
Attempt 2: max_memory={ 0: '20GiB', 1: '20GiB', 2: '20GiB', 3: '20GiB', 'cpu': '450GiB' }
- GPUs 2, 3, 4 start at 20GB VRAM . GPU 1 starts around 10GiB
- All GPU activity is on GPU 1 - GPUs 2, 3, 4 show 0%
- GPU 1 goes OOM very quickly, before layer 2 is reached.
Attempts 3+ - other permutations of max_memory, more or less on each GPU
- Same result at attempt 2, always OOM on GPU 1 before layer 2 is reached
The problem seems to be that it needs more than 80GB context, so it always OOMs? And I can't get any GPU with >80GB VRAM.
I could rent a system with even more GPUs. But it doesn't seem to be a problem with the number of GPUs. But rather with the context on GPU 1.
Unless I am missing some option or technique I should try?
I remember there was a post recently by a guy who did this entirely on CPU, but it took him 4+ days. If it can be done only on CPU, surely there must be a way to do it on GPU, without OOM?
Any advice would be really great. I've already spent quite a bit of $ trying to get this to work so would love to know what else there is to try.
Thanks in advance!
Hi, I think conversation in here might help you, where a 176B Bloom was quantized succeed eventually.
OK thank you, I understand now. The issue we have is that we can only get enough RAM if we rent 4 x A100s, but then only one A100 is actually used and the cost is huge.
It is a big shame that disk offload is not yet supported, as this would solve the problem I think. Then I could specify max CPU RAM at 250GB and have the rest stored on disk, and this would greatly reduce the cost as I could use a 2 x A100 system instead of 4 x A100. Or I have access to a 1 x H100 system with 200GB RAM.
Can I ask what the issue is with disk offload? Does it need lots of changes to support this?
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
I use a single card A100 to quantify the bloom176B model, and after I add this environment variable, the OOM issue gets skipped occasionally.
Oh thank thank you @Lihengwannafly !
Can I ask, how long did it take to quantise and pack the model?
Actually implement disk offload using accelerate will as easy as implement CPU offload, I just don't recognize that there are demands, but now I do, and I will add it into my plan!
OK thank you! Yes I have never needed it before, but in this case it would be really useful. I think my H100 server with 200GB RAM would be quite fast for this task, even with the RAM and disk offload.
Oh thank thank you @Lihengwannafly !哦谢谢谢谢!
Can I ask, how long did it take to quantise and pack the model?请问,模型量化打包用了多长时间?
when in 256 samples, it takes about 10 hours. Every 10 layers takes 1 hour when quantizing.
BOOM!
Took me weeks to find the right machine, but I eventually got it done using 1 x H100 80GB on a system provided by Latitude.sh with 750GB RAM. CPU is AMD EPYC 9354 32-Core Processor
. The system actually has 4 x H100 but the other 3 weren't used.
And it didn't take anywhere near 55 hours! To be exact the whole process took 224 minutes / 3 hours 44 minutes. Quantising took about 2.5 minutes per layer, so about 175 minutes total. I was really scared about how long packing was going to take, but that was even faster - I think this CPU is really good.
Thanks so much for the suggestions. I used PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32
and cache_examples_on_gpu=False
and this did the trick. I was up at 99% VRAM usage several times, but it never went over.
As before I couldn't get it using multiple GPUs. At first I tried this:
max_memory={0:0, 1:'78GiB', 2:'78GiB', 3:'78GiB', 'cpu':'500GiB'}
I figured that would allow me to use GPU 0 for quantisation, while storing the weights on GPUs 1, 2 and 3. But it failed immediately with:
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Maybe because I told it to use 0GB on GPU 0, it didn't even initialise that GPU? But then I couldn't let it use any VRAM on GPU 0 else it would definitely OOM.
It's a shame I couldn't find a way to make use of the other GPUs, but I'm super happy I got it working.
When in use the model needs 92GB VRAM to load (eg 46GB on two GPUs), but with context that's likely to grow, so I would think that 2 x 80GB or 3 x 48GB GPUs will be needed.
Here it is running on 2 x H100 80GB:
Output generated in 10.83 seconds (4.52 tokens/s, 49 tokens, context 46, seed 1138329768)
I will make the model available to everyone via Hugging Face Hub shortly!
I also quantised BLOOMChat v1.0, which is probably more interesting than BLOOMZ. Here are the models uploaded to HF:
- https://huggingface.co/TheBloke/bloomz-176B-GPTQ
- https://huggingface.co/TheBloke/BLOOMChat-176B-v1-GPTQ
Make sure to read the README - special steps are needed! You need to manually join 3 x split files into the safetensors
file, as AutoGPTQ doesn't yet support sharding and HF won't allow uploading files bigger than 50GB.
@TheBloke @PanQiWei @Qubitium @Lihengwannafly @Sciumo How can we use multiple GPUs to quantize the large model?