torchtune Multi-GPU QLoRA?

Hi, first of all thanks for the great tutorials on lora and qlora! I was able to follow them very easily. I was wondering if multi-gpu QLoRA is supported? I couldn't find a config file in the repo, and when I tried using the multi-gpu LoRA recipe and adding model.quantize_base=True, I get this error:

ValueError: The module has CPU parameters or buffers when `sync_module_states=True`, which requires them to be on GPU. Please specify the `device_id` argument or move the module to GPU before passing it to FSDP.

I was wondering if multi-gpu QLoRA is supported currently, or if it is on the roadmap? Thanks a lot!

Apr 23 '24 00:04 cuichenx

Hey @cuichenx - glad you found the tutorials useful!

Currently, multi-GPU FSDP + QLoRA is not supported in torchtune, but this is something we are actively working on. Turns out it's a non-trivial combination. See this blog post from the folks over at answer.ai for some more information.

cc: @rohan-varma

Apr 23 '24 00:04 joecummings

Thanks for the fast response! Looking forward to it :)

Apr 23 '24 01:04 cuichenx

@cuichenx I'd be curious to learn more about your use case. Are you looking at QLoRA instead of LoRA because of memory constraints? Or something else? My impression has been that LoRA gives a higher quality model though at slightly more memory usage. Wondering if you've tried LoRA and if this has not worked on your setup? Thanks for taking a look at torchtune! :)

Apr 23 '24 01:04 kartikayk

Hi @kartikayk, I'm currently doing some exploratory studies on QLoRA vs LoRA, so I was looking for a more apples-to-apples comparison because LoRA for a larger model like 34B or 70B would need multiple GPUs. But for now I can do my studies on the smaller models. Thanks for making this awesome framework!

Apr 23 '24 17:04 cuichenx

@cuichenx sounds awesome! We'll make sure to comment on here as soon as we have this up and running!

Apr 23 '24 17:04 kartikayk

Thanks for trying out QLoRA @cuichenx and glad to hear that the tutorial is helpful!

Re: LoRA vs QLoRA, as per the tutorial and enablement PR (https://github.com/pytorch/torchtune/pull/478), in my experience we're actually able to get pretty good convergence w/QLoRA and match LoRA for some eval tasks, with about 50% memory savings. As mentioned though, we don't yet have the multi-GPU support and are working on support for this.

Apr 23 '24 22:04 rohan-varma

This was recently added in #909 and is currently available as an experimental feature in our latest stable version. Closing as completed for now, please reopen if you run into any issues using it.

Jul 19 '24 05:07 RdoubleA