_process_names in class DicomImageRedactorEngine unconditionally adds all DICOM metadata strings as PHI
Describe the bug _process_names in class DicomImageRedactorEngine unconditionally adds all DICOM metadata strings as PHI
To Reproduce
Code inspection: first line is phi_list = text_metadata.copy() and then it is just extended when is_name is
True:
def _process_names(cls, text_metadata: list, is_name: list) -> list:
"""Process names to have multiple iterations in our PHI list.
:param text_metadata: List of all the instance's element values
(excluding pixel data).
:param is_name: True if the element is specified as being a name.
:return: Metadata text with additional name iterations appended.
"""
phi_list = text_metadata.copy()
for i in range(0, len(text_metadata)):
if is_name[i] is True:
original_text = str(text_metadata[i])
phi_list += cls.augment_word(original_text)
return phi_list
A second issue is that it is converting MultiValue values into strings directly so you end up with an array of strings looking like: ["['1000', '234566']", 'Joe'] instead of the function returning an array of strings: ['1000', '234566', 'Joe'].
Expected behavior _process_names should only add words where is_name is True and should handle MultiValue metadata fields correctly.
Additional context Too many words are being removed when all we should be removing is PHI and multi-value meta strings are being added incorrectly.
@atinm Is your merged PR https://github.com/microsoft/presidio/pull/1723 mitigatig this?
@SharonHart no, I discovered this while doing this PR but didn't want to change the behavior without discussion as it would change behavior and metadata words and characters that are being redacted now might not be redacted anymore as they don't fall under name or patient (correctly) and might surprise users.