hover_net
hover_net copied to clipboard
Weird errors, this error will appear after running several epochs;
Hi, @vqdang, when I training hover-net on the Kumar dataset for several epoches, it will rush on this error; If there is a problem with the training data, why other epochs can be trained soothly? My system information is as follows:
Linux version 3.10.0-1160.24.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Thu Apr 8 19:51:47 UTC 2021 Python 3.7.4 Cuda compilation tools, release 10.2, V10.2.89 pytorch 1.8.1+cu102
Traceback (most recent call last):
File "run_train.py", line 309, in
I guess this error was caused by out of GPU memory, so I add torch.cuda.empty_cache() after each epoch; It works for me now.
I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting
I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting
Thanks, vqdang, I update my system information.
Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?
Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?
Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.
pbar.update()
pbar.close() # to flush out the bar before doing end of epoch reporting
self.state.curr_epoch += 1
self.__trigger_events(Events.EPOCH_COMPLETED)
torch.cuda.empty_cache() ########## to free some GPU for a cuda out of Mem bug
# TODO: [CRITICAL] align the protocol
self.state.run_accumulated_output.append(
self.state.epoch_accumulated_output
)
return
Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?
Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.
pbar.update() pbar.close() # to flush out the bar before doing end of epoch reporting self.state.curr_epoch += 1 self.__trigger_events(Events.EPOCH_COMPLETED) torch.cuda.empty_cache() ########## to free some GPU for a cuda out of Mem bug # TODO: [CRITICAL] align the protocol self.state.run_accumulated_output.append( self.state.epoch_accumulated_output ) return
Many thanks, will have a try.
Some feedback to @vqdang , seems this error will not disappear using the above proposed solution from my side.
any other solutions for this error?
inst_com = list(measurements.center_of_mass(inst_map)) inst_com = [x if math.isnan(x) == False else 0 for x in inst_com ] #added new line inst_com[0] = int(inst_com[0] + 0.5)
the above modification in targets.py worked for me
Same, I'm getting:
C:\Python310\lib\site-packages\scipy\ndimage\measurements.py:1406: RuntimeWarning: invalid value encountered in double_scalars
results = [sum(input * grids[dir].astype(float), labels, index) / normalizer
Traceback (most recent call last):
File "E:\ai\hover_net\run_train.py", line 305, in <module>
trainer.run()
File "E:\ai\hover_net\run_train.py", line 288, in run
self.run_once(
File "E:\ai\hover_net\run_train.py", line 265, in run_once
main_runner.run(opt["nr_epochs"])
File "E:\ai\hover_net\run_utils\engine.py", line 173, in run
for data_batch in self.dataloader:
File "C:\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 530, in __next__
data = self._next_data()
File "C:\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 570, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "C:\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "C:\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "E:\ai\hover_net\dataloader\train_loader.py", line 119, in __getitem__
target_dict = self.target_gen_func(
File "E:\ai\hover_net\models\hovernet\targets.py", line 103, in gen_targets
hv_map = gen_instance_hv_map(ann, crop_shape)
File "E:\ai\hover_net\models\hovernet\targets.py", line 61, in gen_instance_hv_map
inst_com[0] = int(inst_com[0] + 0.5)
ValueError: cannot convert float NaN to integer
Processing: |###############################################################################################1 | 14/20[23:06<09:54,99.02s/it]Batch = 16.12650|EMA = 85.26806
@sumanthdonapati your solution worked for me! thanks!
@sumanthdonapati @vqdang I have encountered the same error, it actually comes from the padding:
# expand the box by 2px
# Because we first pad the ann at line 207, the bboxes
# will remain valid after expansion
inst_box[0] -= 2
inst_box[2] -= 2
inst_box[1] += 2
inst_box[3] += 2
is the comment referring to some obsolete code that is not used anymore ?
In my case before expanding: inst_box = [0, 254, 118, 256] After expanding: inst_box = [-2, 256, 116, 258] This resulted in inst_map having the shape of (2, 140), containing the region outside the bounding box -> all zeros. The nan values come from calculating the center of mass for an array containing all zeros.
This can be avoided by limiting extension with:
inst_box[0] = max(inst_box[0]-2, 0)
inst_box[2] = max(inst_box[2]-2, 0)
inst_box[1] = min(inst_box[1]+2, orig_ann.shape[0])
inst_box[3] = min(inst_box[3]+2, orig_ann.shape[1])