CenterNet icon indicating copy to clipboard operation
CenterNet copied to clipboard

ConnectionRefusedError: [Errno 111] Connection refused

Open Glutton-zh opened this issue 4 years ago • 1 comments

training loss at iteration 79735: 5.6166815757751465
focal loss at iteration 79735: 5.0547027587890625
pull loss at iteration 79735: 0.0331345796585083
push loss at iteration 79735: 0.30962249636650085
regr loss at iteration 79735: 0.219222292304039
training loss at iteration 79740: 3.3387136459350586
focal loss at iteration 79740: 2.8270068168640137
pull loss at iteration 79740: 0.02639671042561531
push loss at iteration 79740: 0.2322157919406891
regr loss at iteration 79740: 0.25309425592422485
44%|█████████████▎ | 79741/180000 [36:08:34<45:26:33, 1.63s/it]Exception in thread Thread-3: Traceback (most recent call last): File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "train.py", line 51, in pin_memory data = data_queue.get() File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd fd = df.detach() File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 492, in Client c = SocketClient(address) File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused

training loss at iteration 79745: 1.7480967044830322
focal loss at iteration 79745: 1.15070378780365
pull loss at iteration 79745: 0.019453493878245354
push loss at iteration 79745: 0.3843255937099457
regr loss at iteration 79745: 0.19361379742622375
44%|█████████████▎ | 79748/180000 [36:08:45<45:26:22, 1.63s/it]

^CTraceback (most recent call last): File "train.py", line 203, in Process Process-5: Process Process-2: Process Process-1: Process Process-4:

Glutton-zh avatar Apr 23 '20 10:04 Glutton-zh

i use CenterNet to train VOC2007,but it's break at 79748/180000 (at 64th epoch). i try again and break at 68364/180000 again. my gpu memory-usage is 8051mib/5116mib. and the error is:

training loss at iteration 68355: 5.786685466766357
focal loss at iteration 68355: 5.192009925842285
pull loss at iteration 68355: 0.008522081188857555
push loss at iteration 68355: 0.3189387023448944
regr loss at iteration 68355: 0.2672148048877716
38%|███████████▍ | 68357/180000 [27:26:39<44:49:23, 1.45s/it]Exception in thread Thread-3: Traceback (most recent call last): File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 917, in _bootstrap_inner self.run() File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 865, in run self._target(*self._args, **self._kwargs) File "train.py", line 51, in pin_memory data = data_queue.get() File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd fd = df.detach() File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 492, in Client c = SocketClient(address) File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient s.connect(address) ConnectionRefusedError: [Errno 111] Connection refused

training loss at iteration 68360: 6.084456443786621
focal loss at iteration 68360: 5.576683521270752
pull loss at iteration 68360: 0.04028501734137535
push loss at iteration 68360: 0.2413397580385208
regr loss at iteration 68360: 0.22614836692810059
38%|███████████▍ | 68364/180000 [27:26:49<44:49:13, 1.45s/it]

And then the program doesn't run anymore please help me

Glutton-zh avatar Apr 25 '20 02:04 Glutton-zh