Richard Sun
Richard Sun
I am also confused about why we can calculate all the attention scores for the source sentence using the previous hidden state and current input embedding.
I also run into this problem with 4 A100s, even with a small batch size.
I set the cpu_offload option as true (https://lightning.ai/docs/pytorch/2.0.0/_modules/lightning/pytorch/strategies/fsdp.html) for the FSDP strategy and the training process could continue. But I am not sure how long it would take and whether...
Hi rasbt, thanks very much for sharing this project. I can run llama-lora on my local server without much struggle. Is it possible to fine-tune the 65B model on two...