LLaMA-Factory
LLaMA-Factory copied to clipboard
Process hangs in multi-node training
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
I am trying to tune the model with accelerate multi-node training examples(examples/full_multi_gpu/multi_node.sh) But when i execute the shell scripts, each process suddenly hangs.
(Each node has no progress since this point)
This problem also occurs in accelerate fsdp, deepspeed and torch.distributed.run, and only LLaMA-Factory has this issue(other training code works fine).
How can i solve this? I need your insights.
Expected behavior
No response
System Info
-
transformers
version: 4.40.2 - Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.3
- Accelerate version: 0.30.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Others
No response
I also have the same problem but don't know how to solve it
+1
I resolved this issue by first running the process on a single node, using the "cache_dir" parameter to save the tokenized dataset. After that, I proceeded with running the process on multiple nodes.