diffusion icon indicating copy to clipboard operation
diffusion copied to clipboard

leaked shared_memory

Open s5248 opened this issue 2 years ago • 1 comments

frustrated after training about 1654/ba it corrupted, failed to save the checkpoint, tried two times. Error as follows:

[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=39739, OpType=ALLREDUCE, Timeout(ms)=300000) ran for 302714 milliseconds before timing out. train 4%|▉ /home/anaconda3/envs/control/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ----------End global rank 3 STDERR---------- ERROR:composer.cli.launcher:Global rank 0 (PID 35121) has still not exited; return exit code 1.

s5248 avatar Jul 04 '23 06:07 s5248

Can you please provide the full trace? Happy to help out :)

mvpatel2000 avatar Jul 13 '23 05:07 mvpatel2000