AlpinDale
AlpinDale
Great work, @jaemzfleming. It seems the kernels are too inefficient - it takes 10 minutes to load a 1x16 70b on a 3090, and ~4 minutes for 2x8. Have you...
The code still needs testing before an attempt at implemention is made. I have not tested it yet - I'm not 100% sure I've got the layer names correctly. *Theoretically*...
Can confirm that the GPTQ implementation for the GPT-J 6B model (and any model fine-tuned off of it, such as [Pygmalion 6B](https://huggingface.co/PygmalionAI/pygmalion-6b)) seem to be working perfectly.
Needs either either a TPU or GPU (NVIDIA/AMD only). They have to be 8 devices.
> @SparkJiao Sorry but what do you mean by `zero=0`? > > By the way, I just find that removing model.cuda() or model.eval() help me to solve the multiplication error:...
Your LoRA rank might be too high (`r = 128`). I wouldn't recommend going above the effective batch size of `1` either, it seems to negatively affect the train loss...
> @AlpinDale Is the effective batch size equal to the value of `per_device_train_batch_size`? Effective batch size is equal to `per_device_train_batch_size` * `gradient_accumulation_steps`.
@Tostino Yes. Keep in mind though that an effective batch size of 1 results in a *very* slow training time.
Currently having this issue as well. The `CUDA_VISIBLE_DEVICES` environment variable has no effect either, and it only loads the models to GPU 0. I'm running on A100s but still get...
> @dcruiz01 @SunixLiu @AlpinDale vLLM is designed to take almost all of your GPU memory. Could you double-check your GPU is not used by other processes when using vLLM? Thanks,...