PLM-ICD Discrepancy in Model Performance Between MIMIC-50 and MIMIC-Full Datasets with Automatic Mixed Precision

I have encountered a performance discrepancy between two different datasets, MIMIC-50 and MIMIC-Full, while using automatic mixed precision in my model. I followed the same configuration settings and training parameters for both datasets, aiming to reproduce the results from a research paper. While the results for MIMIC-50 are relatively close to the expected outcomes, the results for MIMIC-Full exhibit a notable discrepancy.

Details:

MIMIC-50 Configuration:
- --max_length: 3072
- --chunk_size: 128
- --model_name_or_path: RoBERTa-base-PM-M3-Voc/RoBERTa-base-PM-M3-Voc-hf
- --per_device_train_batch_size: 1
- --gradient_accumulation_steps: 8
- --per_device_eval_batch_size: 1
- --num_train_epochs: 20
- --num_warmup_steps: 2000
- --model_type: roberta
- --model_mode: laat
Results for MIMIC-50 with Automatic Mixed Precision:
- For the best threshold (0.45):
  - f1_micro: 66.96
  - prec_micro: 67.02
  - rec_micro: 66.89
Paper F1 Result: 71.00
MIMIC-Full Configuration:
- --max_length: 3072
- --chunk_size: 128
- --model_name_or_path: RoBERTa-base-PM-M3-Voc/RoBERTa-base-PM-M3-Voc-hf
- --per_device_train_batch_size: 1
- --gradient_accumulation_steps: 8
- --per_device_eval_batch_size: 1
- --num_train_epochs: 20
- --num_warmup_steps: 2000
- --model_type: roberta
- --model_mode: laat
Results for MIMIC-Full with Automatic Mixed Precision:
- For the best threshold (0.2):
  - f1_micro: 13.68
  - prec_micro: 35.35
  - rec_micro: 8.48
Paper F1 Result: 59.8

I kindly request assistance in diagnosing and resolving the performance issue encountered with the MIMIC-Full dataset. The goal is to align the results with the paper's reported metrics as closely as possible.

Thank you for your attention and support in addressing this matter.

Oct 25 '23 07:10 Ahmed-Mortadi

Hello @AHMAD-DOMA, I have also successfully reproduced the results presented in the paper for the MIMIC-3 Full dataset, although I haven't yet done so for the top 50 codes. I utilized the same parameters as in the paper and determined the optimal threshold to be 0.5. The resulting metrics are as follows:

Best Threshold: 0.5

Performance Metrics:

Macro Accuracy: 0.0589
Macro Precision: 0.0984
Macro Recall: 0.0727
Macro F1 Score: 0.0836   
Micro Accuracy: 0.4059
Micro Precision: 0.7148  <---
Micro Recall: 0.4844     <---
Micro F1 Score: 0.5775   <---
Precision at 8: 0.7644
Recall at 8: 0.4026
F1 Score at 8: 0.5274
Macro AUC: 0.9237
Micro AUC: 0.9892

Oct 29 '23 04:10 FareedKhan-dev

Thank you, @FareedKhan-dev. If my understanding is correct, you followed the preprocessing steps as described in the README and conducted training for 20 epochs. If that's correct, could you please share the configurations of your experiment?

Oct 29 '23 05:10 Ahmed-Mortadi

I apologize for any misunderstanding. To clarify, I didn't perform any training but did perform preprocessing. My results are generated based on a pre-trained model they have provided in README

Oct 29 '23 05:10 FareedKhan-dev

Hi @AHMAD-DOMA,

Thank you for your interest in our work! I'd say that the discrepancy in the MIMIC-full configuration is so significant that I suspect there's something wrong with the training process. The following factors might be relevant:

The code might not work with newer versions of the transformers library. I used v4.5.1 in my experiments.
The way you derive the ALL_CODES.txt file might be different. For reference, I have uploaded the codes I used to the pretrained checkpoints.

I'd suggest using the pretrained checkpoints directly as it's the easiest way to replicate the results.

Best, Chao-Wei

Mar 05 '24 04:03 chaoweihuang