Re-populating MIMIC-IV Note with synthetic PHI
Prerequisites
- [X] Put an X between the brackets on this line if you have done all of the following:
- Checked the online documentation: https://mimic.mit.edu/
- Checked that your issue isn't already addressed: https://github.com/MIT-LCP/mimic-code/issues?utf8=%E2%9C%93&q=
Description
Hello! I am part of a project trying to develop a program to deidentify free-text clinical notes for a specific application, and we were thinking about using the MIMIC datasets as train/test data. For that purpose, we would have to re-populate the notes with synthetic PHI. In the case of MIMIC-III, since the category of the PHI is available, a surrogate can be easily inserted, just as the authors of [1] do, e.g., by using the Faker Python module. However, in MIMIC-IV there is no information about the redacted PHI, which makes the re-population much more challenging.
Would it be possible for you to provide guidance about how would you refill the notes with meaningful artificial PHI? Any tip would be helpful. Also, having access to the code (so long it does not raise privacy concerns) that was used to deidentify the notes could be very useful.
Also, out of curiosity, why did the deidentification policy switch from MIMIC-III to MIMIC-IV (from replacing with the PHI category tag to replacing with ___)?
Note: I have read about Annotated MIMIC-IV, which is fantastic, but is only a subset of 100 notes.
Similar issues
#845 also tried to use MIMIC for deidentification research. #173 and #1848 also ask about the release of the deidentification procedure code. So it seems to me like there is a lot of interest regarding how the deidentification was done :)
References
[1] Pissarra, D., Curioso, I., Alveira, J., Pereira, D., Ribeiro, B., Souper, T., Gomes, V., Carreiro, A.V., & Rolla, V. (2024). Unlocking the Potential of Large Language Models for Clinical Text Anonymization: A Comparative Study. ArXiv, abs/2406.00062.