Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

train error with exitcode: -4

Open tiancaitzp opened this issue 1 year ago • 9 comments

The base model and dataset are loaded successfully, when training start, it will get error after a few seconds. The error is as follow:

15F53272B8CE1AE9365CCCFD4E693355

tiancaitzp avatar Apr 10 '24 07:04 tiancaitzp

May I know your command for training? Have you changed any line in your training codes?

JThh avatar Apr 10 '24 13:04 JThh

command: torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path /home/tanzhipeng/open-sora/Open-Sora/datasets/travel.csv

changes in code : image

tiancaitzp avatar Apr 11 '24 08:04 tiancaitzp

May I know why you set use_reentrant = True? It might be not recommended when there are nested modules (link).

JThh avatar Apr 11 '24 12:04 JThh

When I set use_reentant = False,I got the same error. image image

tiancaitzp avatar Apr 12 '24 01:04 tiancaitzp

May I know your outputs of nvidia-smi? The error codes of -4 are often flagged with Out-Of-Memory (OOM) issue.

JThh avatar Apr 14 '24 14:04 JThh

image image

tiancaitzp avatar Apr 16 '24 04:04 tiancaitzp

May I know your outputs of nvidia-smi? The error codes of -4 are often flagged with Out-Of-Memory (OOM) issue.

Hi, is there any update about this issue?

tiancaitzp avatar May 07 '24 08:05 tiancaitzp

Hi @tiancaitzp , can you try it again with export CUDA_LAUNCH_BLOCKING=1? It might raise more specific errors.

FrankLeeeee avatar May 10 '24 06:05 FrankLeeeee

Your code modification seems correct.

zhengzangw avatar May 10 '24 06:05 zhengzangw

Please try OpenSora 1.2. If the problem still exists, feel free to re-open.

zhengzangw avatar Jun 22 '24 04:06 zhengzangw