LLaMA-Factory Process hangs in multi-node training

Process hangs in multi-node training

Open dmammfl opened this issue 9 months ago • 2 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

I am trying to tune the model with accelerate multi-node training examples(examples/full_multi_gpu/multi_node.sh) But when i execute the shell scripts, each process suddenly hangs.

(Each node has no progress since this point)

This problem also occurs in accelerate fsdp, deepspeed and torch.distributed.run, and only LLaMA-Factory has this issue(other training code works fine).

How can i solve this? I need your insights.

Expected behavior

No response

System Info

transformers version: 4.40.2
Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
Python version: 3.10.14
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Others

No response

May 07 '24 09:05 dmammfl

I also have the same problem but don't know how to solve it

May 07 '24 11:05 Kelu007

May 07 '24 13:05 xujunrt

I resolved this issue by first running the process on a single node, using the "cache_dir" parameter to save the tokenized dataset. After that, I proceeded with running the process on multiple nodes.

Oct 03 '24 07:10 EnesAltinisik

LLaMA-Factory LLaMA-Factory copied to clipboard

Process hangs in multi-node training

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard