dmammfl issues

Repositories
Issues
Comments

Results 3 issues of


                                            dmammfl

Is there a plan for supporting full fine-tuning 70B model?

Process hangs in multi-node training

### Reminder - [X] I have read the README and searched the existing issues. ### Reproduction I am trying to tune the model with accelerate multi-node training examples(examples/full_multi_gpu/multi_node.sh) But when...

pending

Only half of parameters are saved when applied PP

I'm currently training Llama-3-8B model in 2 GPUs with Pipeline parallel only. However, when i save a checkpoint on each rank, half of that checkpoint is saved. (Layer 1 is...

bug