hover_net
hover_net copied to clipboard
Can't reproduce publication valuies on kumar dataset
Hi, I'm trying to reproduce the segmentation metrics on the kumar dataset with the Pytorch implementation, but I can't quite reach it. I'm training on a single GPU with less VRAM so I had to reduce the batch size to 4 and 2 in the first and second 50 epochs, but other than that I think I'm using the parameters from the paper. Did you use a different seed than what is found in the codebase? As a training set I used the whole 16 images from the train set and as validation set I now switched to using the combined kumar test set as described here https://github.com/vqdang/hover_net/issues/4. However, I always get values on the merged kumar test set like: 0.80574 0.56586 0.73394 0.76586 0.56405 (Dice AJI DQ SQ PQ) While the data in the publication is (Pytorch is a bit lower but not to that extent): 0.826 0.618 0.770 0.773 0.597 (Dice AJI DQ SQ PQ)
Config and opt file: config_opt.zip Example stats from the training (where I tried with a different seed than in the codebase, which had no success): https://drive.google.com/file/d/1TCUYg23thqvPMyxF0bSwrFbRnEEVyBn2/view?usp=sharing
Thank you!
.
Hi, to give you a proper diagnosis you would need to exactly reproduce the setup that we use. I.e we use a batch size of 8 for the first 50 epochs, which is reduced to 4 in the next 50 epochs. Without matching this, we can't dig further into your results.
Hi @simongraham , I was now able to run it with the correct training batch size of 8 for the first 50 epochs and 4 for the last 50 epochs with the help of Google Colab. (Validation batch size was 2 for the first 50 as it was still run on local machine and then 4 for the last 50, but this shouldn't affect training results to my understanding), so:
#first 50 epochs "batch_size": {"train": 8, "valid": 2,}, #last 50 epochs "batch_size": {"train": 4, "valid": 4,},
However, sadly I still get results in the same range as before (with compute_stats.py): 0.80314 0.57737 0.72974 0.76446 0.56034(Dice AJI DQ SQ PQ)
Inference was run with: python /content/hover_norm/run_infer.py --gpu=0 --nr_types=0 --model_path=<logs_dir>/01/net_epoch=50.tar --model_mode=original --batch_size=8 tile --input_dir=<combined_test_diff_same_dir>/Images --output_dir=<output_dir>
Training file-list (downloaded as linked in this github readme):
['TCGA-18-5592-01Z-00-DX1.tif', 'TCGA-38-6178-01Z-00-DX1.tif', 'TCGA-49-4488-01Z-00-DX1.tif', 'TCGA-50-5931-01Z-00-DX1.tif', 'TCGA-A7-A13E-01Z-00-DX1.tif', 'TCGA-A7-A13F-01Z-00-DX1.tif', 'TCGA-AR-A1AK-01Z-00-DX1.tif', 'TCGA-AR-A1AS-01Z-00-DX1.tif', 'TCGA-B0-5711-01Z-00-DX1.tif', 'TCGA-G9-6336-01Z-00-DX1.tif', 'TCGA-G9-6348-01Z-00-DX1.tif', 'TCGA-G9-6356-01Z-00-DX1.tif', 'TCGA-G9-6363-01Z-00-DX1.tif', 'TCGA-HE-7128-01Z-00-DX1.tif', 'TCGA-HE-7129-01Z-00-DX1.tif', 'TCGA-HE-7130-01Z-00-DX1.tif']
Test file-list (same as validation set, see opening post):
['TCGA-21-5784-01Z-00-DX1.tif', 'TCGA-21-5786-01Z-00-DX1.tif', 'TCGA-AY-A8YK-01A-01-TS1.tif', 'TCGA-B0-5698-01Z-00-DX1.tif', 'TCGA-B0-5710-01Z-00-DX1.tif', 'TCGA-CH-5767-01Z-00-DX1.tif', 'TCGA-DK-A2I6-01A-01-TS1.tif', 'TCGA-E2-A14V-01Z-00-DX1.tif', 'TCGA-E2-A1B5-01Z-00-DX1.tif', 'TCGA-G2-A2EK-01A-02-TSB.tif', 'TCGA-G9-6362-01Z-00-DX1.tif', 'TCGA-KB-A93J-01A-01-TS1.tif', 'TCGA-NH-A8F7-01A-01-TS1.tif', 'TCGA-RD-A8N9-01A-01-TS1.tif']
Logs (only weights of the last epochs from each cycle, tell me if you really need all the weights): https://drive.google.com/file/d/1dZ-IjoHcInNmjT_6rJaJeUGVtJF0-PBF/view?usp=sharing
Thank you!
Could you please detail how you performed this? This requires you to alter either https://github.com/vqdang/hover_net/blob/9b21c8620313b2cd2458611b244e4a4ef4ee8865/run_train.py#L274 or the https://github.com/vqdang/hover_net/blob/9b21c8620313b2cd2458611b244e4a4ef4ee8865/models/hovernet/opt.py#L23 to manually hook the pretrained weights from the first phase. If you are not careful, it will start the training from scratch even.
Hi, I simply modifed the run function like that:
def run(self):
self.nr_gpus = torch.cuda.device_count()
print('Detect #GPUS: %d' % self.nr_gpus)
#print(str(self.model_config))
phase_list = self.model_config["phase_list"]
engine_opt = self.model_config["run_engine"]
prev_save_path = None
i = 0
for phase_idx, phase_info in enumerate(phase_list):
if len(phase_list) == 1:
save_path = os.path.join(self.log_dir, "%02d" % (0))
else:
save_path = os.path.join(self.log_dir, "%02d" % (phase_idx))
if i > 0:
self.run_once(
phase_info, engine_opt, save_path, prev_log_dir=prev_save_path
)
i += 1
prev_save_path = save_path
And then of course kept the directory structure, i.e. subdirectory 00 in the logs directory with net_epoch=50.tar in it from the first 50 epochs.
have you solve it ?