MiniCPM-V icon indicating copy to clipboard operation
MiniCPM-V copied to clipboard

checkpoint shards not loading. Process always gets send to SIGTERM

Open xsMarc opened this issue 1 year ago • 1 comments

any help is appreciated (:

Loading checkpoint shards: 14%|███████ | 1/7 [00:13<01:22, 13.68s/it]W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29734 closing signal SIGTERM W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29735 closing signal SIGTERM W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29737 closing signal SIGTERM E0629 00:06:24.298000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 2 (pid: 29736) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

xsMarc avatar Jun 29 '24 00:06 xsMarc

please provide your code

LDLINGLINGLING avatar Jul 02 '24 00:07 LDLINGLINGLING