hover_net icon indicating copy to clipboard operation
hover_net copied to clipboard

Bug when training Hover_net with pannuke dataset

Open liavilable opened this issue 3 years ago • 2 comments

Hi all, I was training the model with the Pannuke dataset. The following problem occurs and I don't understand whether it is a problem with the data or the code. Here are the specific problems that occur. Thanks very much! Processing: | | 0/11[00:00<?,?it/s] Processing: |9 | 1/11[00:03<00:30, 3.10s/it] Processing: |#8 | 2/11[00:04<00:16, 1.82s/it] Processing: |##7 | 3/11[00:05<00:11, 1.45s/it] Processing: |###6 | 4/11[00:06<00:08, 1.28s/it] Processing: |####5 | 5/11[00:06<00:06, 1.14s/it] Processing: |#####4 | 6/11[00:08<00:05, 1.13s/it] Processing: |######3 | 7/11[00:09<00:04, 1.14s/it] Processing: |#######2 | 8/11[00:10<00:03, 1.10s/it] Processing: |########1 | 9/11[00:11<00:02, 1.12s/it] Processing: |######### | 10/11[00:12<00:01, 1.01s/it] Processing: |##########| 11/11[00:13<00:00, 1.01it/s] Processing: |##########| 11/11[00:13<00:00, 1.20s/it] Traceback (most recent call last): File "run_train.py", line 318, in trainer.run() File "run_train.py", line 300, in run phase_info, engine_opt, save_path, prev_log_dir=prev_save_path File "run_train.py", line 275, in run_once main_runner.run(opt["nr_epochs"]) File "/home/liable/hover_net-master/run_utils/engine.py", line 197, in run self.__trigger_events(Events.EPOCH_COMPLETED) File "/home/liable/hover_net-master/run_utils/engine.py", line 123, in __trigger_events callback.run(self.state, event) File "/home/liable/hover_net-master/run_utils/callbacks/base.py", line 70, in run chained=True, nr_epoch=self.nr_epoch, shared_state=state File "/home/liable/hover_net-master/run_utils/engine.py", line 197, in run self.__trigger_events(Events.EPOCH_COMPLETED) File "/home/liable/hover_net-master/run_utils/engine.py", line 123, in __trigger_events callback.run(self.state, event) File "/home/liable/hover_net-master/run_utils/callbacks/base.py", line 213, in run track_dict = self.proc_func(raw_data) File "/home/liable/hover_net-master/models/hovernet/opt.py", line 139, in lambda a: proc_valid_step_output(a, nr_types=nr_type) File "/home/liable/hover_net-master/models/hovernet/run_desc.py", line 290, in proc_valid_step_output patch_prob_np = prob_np[idx] IndexError: list index out of range

liavilable avatar Dec 01 '22 09:12 liavilable

Not sure of your exact setup but that error often happens due to batch_size of the last step being 1. Try to ensure that batch_size is always > 1.

https://github.com/vqdang/hover_net/issues/103#issuecomment-798997624

vqdang avatar Dec 01 '22 11:12 vqdang

Thank you vqdang, that really helps. Also, when I switch to a larger dataset, I get the following bug when training the model with the same hyperparameters. the label of the data have been checked and there are no errors. I would like to know if this bug is also caused by the inappropriate hyperparameters. ----------------EPOCH 1 Processing: | | 0/332[00:00<?,?it/s]Batch = nan|EMA = nanTraceback (most recent call last): File "run_train.py", line 318, in trainer.run() File "run_train.py", line 300, in run phase_info, engine_opt, save_path, prev_log_dir=prev_save_path File "run_train.py", line 275, in run_once main_runner.run(opt["nr_epochs"]) File "/home/jumengwei/hover_net-master/run_utils/engine.py", line 182, in run step_output = self.run_step(data_batch, step_run_info) File "/home/jumengwei/hover_net-master/models/hovernet/run_desc.py", line 54, in train_step true_tp_onehot = F.one_hot(true_tp, num_classes=model.module.nr_types) RuntimeError: Class values must be smaller than num_classes. Processing: | | 0/332[00:01<?,?it/s]Batch = nan|EMA = nan

liavilable avatar Dec 05 '22 10:12 liavilable