LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Process hangs in multi-node training

Open dmammfl opened this issue 9 months ago • 2 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

I am trying to tune the model with accelerate multi-node training examples(examples/full_multi_gpu/multi_node.sh) But when i execute the shell scripts, each process suddenly hangs.

image (Each node has no progress since this point)

This problem also occurs in accelerate fsdp, deepspeed and torch.distributed.run, and only LLaMA-Factory has this issue(other training code works fine).

How can i solve this? I need your insights.

Expected behavior

No response

System Info

  • transformers version: 4.40.2
  • Platform: Linux-5.15.0-60-generic-x86_64-with-glibc2.35
  • Python version: 3.10.14
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

No response

dmammfl avatar May 07 '24 09:05 dmammfl

I also have the same problem but don't know how to solve it

Kelu007 avatar May 07 '24 11:05 Kelu007

+1

xujunrt avatar May 07 '24 13:05 xujunrt

I resolved this issue by first running the process on a single node, using the "cache_dir" parameter to save the tokenized dataset. After that, I proceeded with running the process on multiple nodes.

EnesAltinisik avatar Oct 03 '24 07:10 EnesAltinisik