Angainor Development comments

Results 70 comments of


                                            Angainor Development

Enabling model parallelism (training 30b on 2x 3090s and beyond)

I just updated to git+https://github.com/kooshi/transformers.git@balanced_memory_8bit > how did you force max_memory? I edited finetune.py line 78 to be I used `max_memory={0: "15GB", 1: "15GB"},` This seems to have no effect,...

Enabling model parallelism (training 30b on 2x 3090s and beyond)

Yeah, But torch.cuda.device_count() correctly detects the 2 GPUs. `CUDA_VISIBLE_DEVICES` was not set, I explicitely set it to `CUDA_VISIBLE_DEVICES=0,1` , no change. Second one gets a bit of vram used when...

Enabling model parallelism (training 30b on 2x 3090s and beyond)

Thanks for the follow up. Agree, something could be broken in my setup, I'll do from a clean one next time I'll try, thanks!

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1024x6 and 7x4096)

See this other - same - issue and answers https://github.com/tloen/alpaca-lora/issues/8#issuecomment-1477490259 Training on multiple GPUs is possible with torchrun, you'll double batch size and half the training time. Take care of...

Keep previous settings by default?

Just to clarify, because I read several questions around this and I'd like to understand the rational behind it: Before this commit https://github.com/tloen/alpaca-lora/commit/b12c3b90f808e7d62709aad104d4fac1fbc880eb the prompt was masked in the labels....

Keep previous settings by default?

Oh, ok! thanks!

accelerate the processing of dataset

Great suggestion! I'd like this to be an extra param, or be auto computed from number of cores / number of gpus. When running with DDP for instance, we don't...

Fix gradient_accumulation_steps == 0

Yep. I'll propose a deeper check in a future PR, as all these params have to match and be consistent with each other and training dataset size. For instance, in...

Fix gradient_accumulation_steps == 0

> What do you think of an alternative fix where: Yep, what I had in mind was a consistency check of the entangled params, warning and auto fix if possible....

There were missing keys in the checkpoint model loaded

Why don't we set adapter_name from now on to avoid tweaking lib code? Defaut name from current peft lib is `adapter_name="default"` just adding this at training end in `get_peft_model_state_dict(` would...