CenterNet
CenterNet copied to clipboard
ConnectionRefusedError: [Errno 111] Connection refused
training loss at iteration 79735: 5.6166815757751465
focal loss at iteration 79735: 5.0547027587890625
pull loss at iteration 79735: 0.0331345796585083
push loss at iteration 79735: 0.30962249636650085
regr loss at iteration 79735: 0.219222292304039
training loss at iteration 79740: 3.3387136459350586
focal loss at iteration 79740: 2.8270068168640137
pull loss at iteration 79740: 0.02639671042561531
push loss at iteration 79740: 0.2322157919406891
regr loss at iteration 79740: 0.25309425592422485
44%|█████████████▎ | 79741/180000 [36:08:34<45:26:33, 1.63s/it]Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 51, in pin_memory
data = data_queue.get()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd
fd = df.detach()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
training loss at iteration 79745: 1.7480967044830322
focal loss at iteration 79745: 1.15070378780365
pull loss at iteration 79745: 0.019453493878245354
push loss at iteration 79745: 0.3843255937099457
regr loss at iteration 79745: 0.19361379742622375
44%|█████████████▎ | 79748/180000 [36:08:45<45:26:22, 1.63s/it]
^CTraceback (most recent call last):
File "train.py", line 203, in
i use CenterNet to train VOC2007,but it's break at 79748/180000 (at 64th epoch). i try again and break at 68364/180000 again. my gpu memory-usage is 8051mib/5116mib. and the error is:
training loss at iteration 68355: 5.786685466766357
focal loss at iteration 68355: 5.192009925842285
pull loss at iteration 68355: 0.008522081188857555
push loss at iteration 68355: 0.3189387023448944
regr loss at iteration 68355: 0.2672148048877716
38%|███████████▍ | 68357/180000 [27:26:39<44:49:23, 1.45s/it]Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 51, in pin_memory
data = data_queue.get()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd
fd = df.detach()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused
training loss at iteration 68360: 6.084456443786621
focal loss at iteration 68360: 5.576683521270752
pull loss at iteration 68360: 0.04028501734137535
push loss at iteration 68360: 0.2413397580385208
regr loss at iteration 68360: 0.22614836692810059
38%|███████████▍ | 68364/180000 [27:26:49<44:49:13, 1.45s/it]
And then the program doesn't run anymore please help me