Wrong decoding from ISO 2022 IR 100 to ISO IR 192 (UTF-8)?
Intro
Hello,
I was working on a feature that I wanted to contribute to OHIF. The feature is a component for rendering encapsulated reports. During my work, I noticed that reports were showing up with  , which is  .
I look at an internal sample DICOM SR and the report switches encoding in the Findings node of the SR module to ISO 2022 IR 100.
Based on issues #84, #373, and #451, I am assuming that dcmjs tries to set the specific character set early on and perform conversions based on that.
If that is the case, perhaps it is not switching to a new decoder to properly decode the input encapsulated report to utf-8.
I am new to dcmjs itself and it looks like this will be an involved issue. I am hoping someone with more experience can point in the right direction before I find the time to investigate the issue.
I will provide more information as I come across it.
Examples
HTML Rendering of Report
Expected HTML Input (Different Sample at Hand in Text Editor)
Per DCMTK
ASCII (ISO_IR 6) => (none)
UTF-8 "ISO_IR 192" => "UTF-8"
ISO Latin 1 "ISO_IR 100" => "ISO-8859-1"
ISO Latin 2 "ISO_IR 101" => "ISO-8859-2"
ISO Latin 3 "ISO_IR 109" => "ISO-8859-3"
ISO Latin 4 "ISO_IR 110" => "ISO-8859-4"
ISO Latin 5 "ISO_IR 148" => "ISO-8859-9"
ISO Latin 9 "ISO_IR 203" => "ISO-8859-15"
Cyrillic "ISO_IR 144" => "ISO-8859-5"
Arabic "ISO_IR 127" => "ISO-8859-6"
Greek "ISO_IR 126" => "ISO-8859-7"
Hebrew "ISO_IR 138" => "ISO-8859-8"
Thai "ISO_IR 166" => "TIS-620"
Japanese "ISO 2022 IR 13\ISO 2022 IR 87" => "ISO-2022-JP"
Korean "ISO 2022 IR 6\ISO 2022 IR 149" => "ISO-2022-KR"
Chinese "ISO 2022 IR 6\ISO 2022 IR 58" => "ISO-2022-CN"
Chinese "GB18030" => "GB18030"
Chinese "GBK" => "GBK"
https://support.dcmtk.org/docs/dsr2html.html
In the Codebase
The default decoder is latin1.
The decoder is set in DicomMessage
The encoding scheme is mapped correctly
Looking Back at a Sample SR
I cannot see an initial specific character set being specified so I am going to assume everything starts in latin1 and then gets switched to latin1?
Conclusion
Now, I am not sure if the error is in the library or elsewhere. Everything looks at it should upon code review, but that does not explain why the DICOM read in OHIF resulted in a decoding defect.
Thanks for raising and working on this. I haven't got much experience with multiple encodings, but from what I understand it should be quite localized to the encoding/decoding of a specific list of VRs. Basically that escape characters in the byte array indicated switches between character encodings in the list from the SpecificCharacterSet element. Each of these sub-regions would then be converted to UTF-8 for javascript manipulation.
Probably writing out UTF-8 is all that would need to be supported.
Looking at the implementation in pydicom seems helpful.
https://github.com/pydicom/pydicom/blob/3ec634e4bd2b7d87859b6b906defc3b0bb099c2a/src/pydicom/charset.py#L251
I apologize!
Please, disregard.
I assumed that the input dataset would get a pass through dcmjs and eventually get decoded. However, I think in my case the data is not getting any conversions done so this might be an OHIF thing I might need to provide a patch for.
Before closing the ticket, I am going to take a moment to see if there are any ready made exported functions I can use from the OHIF side to perform the same decoding as done in dcmjs. I think OHIF should always enforce decoding through this library if possible since it looks like it is currently capable of handling this situation.
Again, I apologize. I jumped the gun.
Running a test snippet like below gave the correct decoding.
That's okay - I think the issue you pointed to (#373) is definitely a limitation in dcmjs and it would be great for someone to work on it someday.
Glad if it's not the source of the problem for you though - good luck 👍
@pieper I do not see an interface function or class for quick conversion of an arbitrary DICOM buffer of known initial encoding to UTF-8. I do see the components in dcmjs. Do you think I could go ahead and propose a PR exposing encodingMapping for consumption elsewhere (OHIF in my case)?
@luissantosHCIT yes, I think it would make sense for those mappings to be exposed. Probably also encapsulatedSyntaxes and singleVRs. Looking now from that perspective they might belong in the meta dictionary but exposing them from here is easier and probably just fine.