Sebastian Raschka comments

Results 846 comments of


                                            Sebastian Raschka

Mixtral 8x22B support

Right now, given that there are so many other things to do, I haven't had that on my priority list. But we'd be happy about contributions if you are interested...

Do not wrap LoRA layers with FSDP

Thanks for the update @janEbert ! This looks good to me. Btw have you done a comparison (re memory usage) before and after by chance?

Do not wrap LoRA layers with FSDP

I see, yeah I think we should do some comparisons to make sure it works as intended. If you want to do them, that'd be nice! I suggest perhaps with...

Do not wrap LoRA layers with FSDP

That'd be awesome. And pls let me know in case you need any help!

Do not wrap LoRA layers with FSDP

@janEbert Looks awesome, which model is that? I am also rerunning some of the models in the config hub and will update the numbers accordingly!

Do not wrap LoRA layers with FSDP

I just ran a quick comparison on an 4xA10G machine to see if I can reproduce the config hub performance ``` | falcon-7b/lora.yaml | falcon-7b | 4 | 512 |...

Do not wrap LoRA layers with FSDP

Not sure. I observed it with Phi-2 too: Main branch: ```bash litgpt finetune_lora checkpoints/microsoft/phi-2/ --devices 4 ``` ``` Epoch 1 | iter 1 step 0 | loss train: 2.424, val:...

Do not wrap LoRA layers with FSDP

> Why does the loss train increases (for the code from this PR)? From 2.299 up to 17.512. I am curious if the whole Block was maybe accidentally trainable (instead...

Do not wrap LoRA layers with FSDP

That's a good point, but I think there is a different issue here that I am not understanding yet 😅. When I reran the code I observed basically the same...

Do not wrap LoRA layers with FSDP

Thanks for looking into this @TensorTemplar . I think that this may not be feasible then, so I am closing the PR for now. But happy to revisit this with...