Wsi-Caption The test results are very poor

Dear author, thank you for providing such an excellent model. I can achieve similar results as you during running training. But it almost failed during testing. Here are my training and testing results. Testing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98/98 [17:49<00:00, 10.91s/it] {'testlen': 14243, 'reflen': 14388, 'guess': [14243, 14145, 14047, 13949], 'correct': [5755, 1976, 718, 266]} ratio: 0.9899221573532812 epoch : 60 train_loss : 0.058107303353536185 val_BLEU_1 : 0.3779433481411532 val_BLEU_2 : 0.2284521199738517 val_BLEU_3 : 0.13936495211614205 val_BLEU_4 : 0.08774822165710586 val_METEOR : 0.1495248850246772 val_ROUGE_L : 0.24553979627265676 test_BLEU_1 : 0.3999655121193585 test_BLEU_2 : 0.23517579042605666 test_BLEU_3 : 0.14091858033187665 test_BLEU_4 : 0.0852521828737444 test_METEOR : 0.15742257069415153 test_ROUGE_L : 0.24202879420907616 Saving checkpoint: results/BRCA\current_checkpoint.pth ... (DL01) E:\xxll\Wsi-Caption-master>python main.py --mode 'Test' --image_dir E:/xxll/Wsi-Caption-master/pt_files --ann_path E:/xxll/Wsi-Caption-master/TCGA-BRCA --split_path E:/xxll/Wsi-Capt ion-master/ocr/dataset_csv/splits_0.csv --checkpoint_dir E:/xxll/Wsi-Caption-master/results/BRCA The size of train dataset: 804 The size of val dataset: 95 The size of test dataset: 98 use encoder_decoder: default 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98/98 [11:18<00:00, 6.92s/it] {'testlen': 58800, 'reflen': 14388, 'guess': [58800, 58702, 58604, 58506], 'correct': [13, 0, 0, 0]} ratio: 4.086738949123986 Results in test set test_BLEU_1 : 0.00022108843537414595 test_BLEU_2 : 1.9406917697642934e-12 test_BLEU_3 : 4.005548145143126e-15 test_BLEU_4 : 1.8205238188759316e-16 test_METEOR : 0.0007464332412179531 test_ROUGE_L : 0.0004049096621859695 Part of the generated text report： {'predict': 'streaming uninvolved streaming uninvolved streaming uninvolved streaming uninvolved streaming uninvolved streaming uninvolved streaming uninvolved streaming uninvolved streaming uninvolved streaming centrally streaming centrally streaming centrally streaming uninvolved streaming centrally streaming centrally streaming centrally streaming centrally noting centrally noting streaming centrally noting streaming centrally streaming centrally streaming centrally streaming centrally streaming centrally streaming centrally streaming centrally noting streaming centrally noting streaming centrally streaming centrally streaming centrally streaming

Feb 17 '25 12:02 jasonlightx

I will check the code. Have you tried with our provided checkpoint? Does it still fail?

Feb 18 '25 02:02 cpystan

First of all, thank you for your answer. Yes, I have tried your CKPT and my own CKPT, and the generated results during the testing phase are completely consistent, with almost no effect. I don't know why.

Feb 18 '25 09:02 jasonlightx

That means your ckpt and my ckpt neither can generate correct answer?

Feb 18 '25 09:02 cpystan

yes

Feb 18 '25 09:02 jasonlightx

We find that the checkpoint we uploaded is wrong and should not be used for test. We are so sorry for our mistake. We will upload a new ckpt later.

Feb 27 '25 09:02 cpystan