LLaVA-NeXT icon indicating copy to clipboard operation
LLaVA-NeXT copied to clipboard

torch.distributed.elastic.multiprocessing.errors.ChildFailedError Error

Open ayushgupta9198 opened this issue 1 year ago • 0 comments
trafficstars

Hi All,

I have step up everything with LLaVA-Next repo. and I want to run the pretrain code file for one vision dataset however when I am running the code file it will run but after a certain time the code crash automatically and give me above error as ChildFailedError.

I have add script to use the dataset from hugging face to LLaVa-Next Model : https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data

I am not able to figure it out for now , can anyone help me to fix this issue so that I can go with pretrain < Fine tune < inference. also please check the SS for reference of error I am getting after running.

Please share your thoughts on this.

Screenshot (15) Screenshot (16) Screenshot (17) Screenshot (14)

Thanks.

ayushgupta9198 avatar Sep 25 '24 12:09 ayushgupta9198