Open-Sora
Open-Sora copied to clipboard
train error with exitcode: -4
The base model and dataset are loaded successfully, when training start, it will get error after a few seconds. The error is as follow:
May I know your command for training? Have you changed any line in your training codes?
command: torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path /home/tanzhipeng/open-sora/Open-Sora/datasets/travel.csv
changes in code :
May I know why you set use_reentrant = True? It might be not recommended when there are nested modules (link).
When I set use_reentant = False,I got the same error.
May I know your outputs of nvidia-smi? The error codes of -4 are often flagged with Out-Of-Memory (OOM) issue.
May I know your outputs of
nvidia-smi? The error codes of-4are often flagged with Out-Of-Memory (OOM) issue.
Hi, is there any update about this issue?
Hi @tiancaitzp , can you try it again with export CUDA_LAUNCH_BLOCKING=1? It might raise more specific errors.
Your code modification seems correct.
Please try OpenSora 1.2. If the problem still exists, feel free to re-open.