AutoGPTQ Advice for quantizing BLOOMZ 175B

Hi @PanQiWei

I'd be most grateful if you could give me a bit of help.

I have been trying to quantize BLOOMZ 175B but can't currently get it done. BLOOMZ has 70 layers, and is a total of 360GB.

Hardware: 4 x A100 80G with 500GB RAM

Basic setup:

model = AutoGPTQForCausalLM.from_pretrained(model_dir,
            quantize_config=quantize_config,  # bits = 4, group_size = -1, desc_act = True
            low_cpu_mem_usage=True,
            torch_dtype=torch.float16,
            trust_remote_code=False)
model.quantize(traindataset, use_triton=False, batch_size=1)

Attempt 1: no max_memory

result:

OOM on GPU 1 at around layer 45 of 70
zero VRAM usage on GPUs 2, 3, 4. Only GPU 1 is used
GPU 1 starts around ~40GB VRAM used, goes up and down, until eventually OOM

Attempt 2: max_memory={ 0: '20GiB', 1: '20GiB', 2: '20GiB', 3: '20GiB', 'cpu': '450GiB' }

GPUs 2, 3, 4 start at 20GB VRAM . GPU 1 starts around 10GiB
All GPU activity is on GPU 1 - GPUs 2, 3, 4 show 0%
GPU 1 goes OOM very quickly, before layer 2 is reached.

Attempts 3+ - other permutations of max_memory, more or less on each GPU

Same result at attempt 2, always OOM on GPU 1 before layer 2 is reached

The problem seems to be that it needs more than 80GB context, so it always OOMs? And I can't get any GPU with >80GB VRAM.

I could rent a system with even more GPUs. But it doesn't seem to be a problem with the number of GPUs. But rather with the context on GPU 1.

Unless I am missing some option or technique I should try?

I remember there was a post recently by a guy who did this entirely on CPU, but it took him 4+ days. If it can be done only on CPU, surely there must be a way to do it on GPU, without OOM?

Any advice would be really great. I've already spent quite a bit of $ trying to get this to work so would love to know what else there is to try.

Thanks in advance!

Jun 05 '23 00:06 TheBloke

Hi, I think conversation in here might help you, where a 176B Bloom was quantized succeed eventually.

Jun 05 '23 02:06 PanQiWei

OK thank you, I understand now. The issue we have is that we can only get enough RAM if we rent 4 x A100s, but then only one A100 is actually used and the cost is huge.

It is a big shame that disk offload is not yet supported, as this would solve the problem I think. Then I could specify max CPU RAM at 250GB and have the rest stored on disk, and this would greatly reduce the cost as I could use a 2 x A100 system instead of 4 x A100. Or I have access to a 1 x H100 system with 200GB RAM.

Can I ask what the issue is with disk offload? Does it need lots of changes to support this?

Jun 05 '23 09:06 TheBloke

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 I use a single card A100 to quantify the bloom176B model, and after I add this environment variable, the OOM issue gets skipped occasionally.

Jun 05 '23 11:06 Lihengwannafly

Oh thank thank you @Lihengwannafly !

Can I ask, how long did it take to quantise and pack the model?

Jun 05 '23 12:06 TheBloke

Actually implement disk offload using accelerate will as easy as implement CPU offload, I just don't recognize that there are demands, but now I do, and I will add it into my plan!

Jun 05 '23 12:06 PanQiWei

OK thank you! Yes I have never needed it before, but in this case it would be really useful. I think my H100 server with 200GB RAM would be quite fast for this task, even with the RAM and disk offload.

Jun 05 '23 12:06 TheBloke

Oh thank thank you @Lihengwannafly !哦谢谢谢谢！

Can I ask, how long did it take to quantise and pack the model?请问，模型量化打包用了多长时间？

when in 256 samples, it takes about 10 hours. Every 10 layers takes 1 hour when quantizing.

Jun 09 '23 06:06 Lihengwannafly

BOOM!

Took me weeks to find the right machine, but I eventually got it done using 1 x H100 80GB on a system provided by Latitude.sh with 750GB RAM. CPU is AMD EPYC 9354 32-Core Processor. The system actually has 4 x H100 but the other 3 weren't used.

And it didn't take anywhere near 55 hours! To be exact the whole process took 224 minutes / 3 hours 44 minutes. Quantising took about 2.5 minutes per layer, so about 175 minutes total. I was really scared about how long packing was going to take, but that was even faster - I think this CPU is really good.

Thanks so much for the suggestions. I used PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 and cache_examples_on_gpu=False and this did the trick. I was up at 99% VRAM usage several times, but it never went over.

As before I couldn't get it using multiple GPUs. At first I tried this:

max_memory={0:0, 1:'78GiB', 2:'78GiB', 3:'78GiB', 'cpu':'500GiB'}

I figured that would allow me to use GPU 0 for quantisation, while storing the weights on GPUs 1, 2 and 3. But it failed immediately with:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Maybe because I told it to use 0GB on GPU 0, it didn't even initialise that GPU? But then I couldn't let it use any VRAM on GPU 0 else it would definitely OOM.

It's a shame I couldn't find a way to make use of the other GPUs, but I'm super happy I got it working.

When in use the model needs 92GB VRAM to load (eg 46GB on two GPUs), but with context that's likely to grow, so I would think that 2 x 80GB or 3 x 48GB GPUs will be needed.

Here it is running on 2 x H100 80GB:

Output generated in 10.83 seconds (4.52 tokens/s, 49 tokens, context 46, seed 1138329768)

I will make the model available to everyone via Hugging Face Hub shortly!

Jul 05 '23 13:07 TheBloke

I also quantised BLOOMChat v1.0, which is probably more interesting than BLOOMZ. Here are the models uploaded to HF:

https://huggingface.co/TheBloke/bloomz-176B-GPTQ
https://huggingface.co/TheBloke/BLOOMChat-176B-v1-GPTQ

Make sure to read the README - special steps are needed! You need to manually join 3 x split files into the safetensors file, as AutoGPTQ doesn't yet support sharding and HF won't allow uploading files bigger than 50GB.

Jul 06 '23 15:07 TheBloke

@TheBloke @PanQiWei @Qubitium @Lihengwannafly @Sciumo How can we use multiple GPUs to quantize the large model?

Jul 31 '24 17:07 SuperBruceJia

AutoGPTQ AutoGPTQ copied to clipboard

Advice for quantizing BLOOMZ 175B

AutoGPTQ
AutoGPTQ copied to clipboard