hover_net icon indicating copy to clipboard operation
hover_net copied to clipboard

Weird errors, this error will appear after running several epochs;

Open hitxiaoting opened this issue 3 years ago • 12 comments

Hi, @vqdang, when I training hover-net on the Kumar dataset for several epoches, it will rush on this error; If there is a problem with the training data, why other epochs can be trained soothly? My system information is as follows:

Linux version 3.10.0-1160.24.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Thu Apr 8 19:51:47 UTC 2021 Python 3.7.4 Cuda compilation tools, release 10.2, V10.2.89 pytorch 1.8.1+cu102

Traceback (most recent call last): File "run_train.py", line 309, in trainer.run() File "run_train.py", line 293, in run phase_info, engine_opt, save_path, prev_log_dir=prev_save_path File "run_train.py", line 268, in run_once main_runner.run(opt["nr_epochs"]) File "/home/tingxiao/code/hover_net/run_utils/engine.py", line 172, in run for data_batch in self.dataloader: File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data return self._process_data(data) File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) ValueError: Caught ValueError in DataLoader worker process 11. Original Traceback (most recent call last): File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/tingxiao/code/hover_net/dataloader/train_loader.py", line 105, in getitem inst_map, self.mask_shape, **self.target_gen_kwargs File "/home/tingxiao/code/hover_net/models/hovernet/targets.py", line 102, in gen_targets hv_map = gen_instance_hv_map(ann, crop_shape) File "/home/tingxiao/code/hover_net/models/hovernet/targets.py", line 60, in gen_instance_hv_map inst_com[0] = int(inst_com[0] + 0.5) ValueError: cannot convert float NaN to integer

hitxiaoting avatar Jul 25 '21 00:07 hitxiaoting

I guess this error was caused by out of GPU memory, so I add torch.cuda.empty_cache() after each epoch; It works for me now.

hitxiaoting avatar Jul 25 '21 12:07 hitxiaoting

I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting

vqdang avatar Jul 26 '21 09:07 vqdang

I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting

Thanks, vqdang, I update my system information.

hitxiaoting avatar Jul 26 '21 13:07 hitxiaoting

Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?

cbhindex avatar Nov 22 '21 15:11 cbhindex

Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?

Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.

            pbar.update()
        pbar.close()  # to flush out the bar before doing end of epoch reporting
        self.state.curr_epoch += 1
        self.__trigger_events(Events.EPOCH_COMPLETED)
        torch.cuda.empty_cache()  ########## to free some GPU for a cuda out of Mem bug

        # TODO: [CRITICAL] align the protocol
        self.state.run_accumulated_output.append(
            self.state.epoch_accumulated_output
        )

    return

hitxiaoting avatar Nov 22 '21 19:11 hitxiaoting

Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?

Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.

            pbar.update()
        pbar.close()  # to flush out the bar before doing end of epoch reporting
        self.state.curr_epoch += 1
        self.__trigger_events(Events.EPOCH_COMPLETED)
        torch.cuda.empty_cache()  ########## to free some GPU for a cuda out of Mem bug

        # TODO: [CRITICAL] align the protocol
        self.state.run_accumulated_output.append(
            self.state.epoch_accumulated_output
        )

    return

Many thanks, will have a try.

cbhindex avatar Nov 22 '21 23:11 cbhindex

Some feedback to @vqdang , seems this error will not disappear using the above proposed solution from my side.

cbhindex avatar Nov 23 '21 14:11 cbhindex

any other solutions for this error?

sumanthdonapati avatar Jul 22 '22 14:07 sumanthdonapati

inst_com = list(measurements.center_of_mass(inst_map)) inst_com = [x if math.isnan(x) == False else 0 for x in inst_com ] #added new line inst_com[0] = int(inst_com[0] + 0.5)

the above modification in targets.py worked for me

sumanthdonapati avatar Jul 22 '22 15:07 sumanthdonapati

Same, I'm getting:

C:\Python310\lib\site-packages\scipy\ndimage\measurements.py:1406: RuntimeWarning: invalid value encountered in double_scalars
  results = [sum(input * grids[dir].astype(float), labels, index) / normalizer
Traceback (most recent call last):
  File "E:\ai\hover_net\run_train.py", line 305, in <module>
    trainer.run()
  File "E:\ai\hover_net\run_train.py", line 288, in run
    self.run_once(
  File "E:\ai\hover_net\run_train.py", line 265, in run_once
    main_runner.run(opt["nr_epochs"])
  File "E:\ai\hover_net\run_utils\engine.py", line 173, in run
    for data_batch in self.dataloader:
  File "C:\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 530, in __next__
    data = self._next_data()
  File "C:\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "E:\ai\hover_net\dataloader\train_loader.py", line 119, in __getitem__
    target_dict = self.target_gen_func(
  File "E:\ai\hover_net\models\hovernet\targets.py", line 103, in gen_targets
    hv_map = gen_instance_hv_map(ann, crop_shape)
  File "E:\ai\hover_net\models\hovernet\targets.py", line 61, in gen_instance_hv_map
    inst_com[0] = int(inst_com[0] + 0.5)
ValueError: cannot convert float NaN to integer
Processing: |###############################################################################################1                                        | 14/20[23:06<09:54,99.02s/it]Batch = 16.12650|EMA = 85.26806

jorgegaticav avatar Nov 21 '22 11:11 jorgegaticav

@sumanthdonapati your solution worked for me! thanks!

jorgegaticav avatar Nov 21 '22 13:11 jorgegaticav

@sumanthdonapati @vqdang I have encountered the same error, it actually comes from the padding:

# expand the box by 2px
# Because we first pad the ann at line 207, the bboxes
# will remain valid after expansion
inst_box[0] -= 2
inst_box[2] -= 2
inst_box[1] += 2
inst_box[3] += 2

is the comment referring to some obsolete code that is not used anymore ?

In my case before expanding: inst_box = [0, 254, 118, 256] After expanding: inst_box = [-2, 256, 116, 258] This resulted in inst_map having the shape of (2, 140), containing the region outside the bounding box -> all zeros. The nan values come from calculating the center of mass for an array containing all zeros.

This can be avoided by limiting extension with:

inst_box[0] = max(inst_box[0]-2, 0)
inst_box[2] = max(inst_box[2]-2, 0)
inst_box[1] = min(inst_box[1]+2, orig_ann.shape[0])
inst_box[3] = min(inst_box[3]+2, orig_ann.shape[1])

Mgryn avatar Mar 04 '23 09:03 Mgryn