dicom Parsing OB/OW element of undefined length fails and temporarily allocates 4GB of memory

This is along the same lines as #220 in that it manifests as trying to read a value as a 4GB value and getting unexpected EOF.

For a DICOM file with an explicit transfer syntax set, a value that is of a byte or string type (anything that falls to readBytes or readString - this includes e.g. OB, OW, and LT) and has length Undefined Length (i.e. 0xffffffff), the library will allocate a 4GB []byte and try and read to it, producing an error like:

readElement: error when reading value for element (0049,1001): unexpected EOF

I've included a trivial file that can trigger this case - it consists only of the DICOM header, a transfer syntax, patient name (JOHN^DOE) and a tag 0049,1001 (a tag we encountered in the wild that triggered this error) with value type OB and undefined length, with an actual length of 18 characters (I've had to put it in a zip archive, as github doesn't allow uploading of .dcm files): exemplar.dcm.zip

Getting an error in this case is expected. DCMTK will similarly reject this file:

❯ dcm2json exemplar.dcm
E: DcmItem: Parse error in (0049,1001): Illegal element with OB or OW Value Representation and undefined length encountered
E: dcm2json: error (Illegal element with OB or OW Value Representation and undefined length encountered) reading file: exemplar.dcm

The issue is that the library is allocating the 4GB []byte, even though it's discarded pretty quickly (and theoretically garbage collected). With our workflow, we can be reading and parsing several files concurrently, and normally we can use the size of the file to keep memory usage under control. However, because this can trigger even on small files, allocating (and subsequently attempting to write to, so it can't just be handwaved away to virtual memory) the large []byte can pretty quickly cause out-of-memory issues as the allocations & reads can all happen before the memory is freed in garbage collection.

I confirmed that the allocation is happening by adding a log line to readBytes right before the allocation here:

https://github.com/suyashkumar/dicom/blob/0fbaef53037ef4c533a9e6004dfc5b7b224ecc39/read.go#L643-L648

I'm not entirely sure what the correct behaviour here is per the spec. From 7.1.1 Data Element Fields:

Undefined Lengths may be used for Data Elements having the Value Representation (VR) Sequence of Items (SQ) and Unknown (UN). For Data Elements with Value Representation OW or OB Undefined Length may be used depending on the negotiated Transfer Syntax

My interpretation of this is that Undefined Length should not be permitted for OB or OW, unless the transfer syntax specifies an exception (e.g. pixel data may be encoded as OB with Undefined Length).

Given that exceptions like pixel data should already be handled, I think a pretty safe fix would be to add a check to both readBytes and readString to preemptively reject the read if the vl is tag.VLUndefinedLength. I'll make a PR along those lines, but would appreciate your input on my interpretation of the spec in this instance.

May 20 '25 15:05 patrickwhite256

Actually, on further reading of the spec, this is covered under PS3.5 7.1.2:

for VRs of OB, OD, OF, OL, OV, OW, SQ and UN, if the Value Field has an Explicit Length, then the Value Length Field shall contain a value equal to the length (in bytes) of the Value Field, otherwise, the Value Field has an Undefined Length and a Sequence Delimitation Item marks the end of the Value Field.

So it sounds like in this case (explicit OB/OW field with undefined length) readElement should fall to readSequence, similar to the existing handling for UN elements. I'll update my PR.

May 21 '25 14:05 patrickwhite256

@patrickwhite256 thanks for reporting this and for your detailed issue! I fully agree it's worth addressing and threading the needle here.

Overall I agree with your interpretation that we should have logic in readElement (or perhaps, in readValue instead if it slots in there nicely, since that's where the UN fallback is) fall through to readSequence for the appropriate VRs when they have undefined length VLs. This is a good find and will make our library a bit more robust here--and perhaps the DICOMs aren't malformed after all, and have some data buried in there (though up to the user on what they want to do with it).

There's a bit of a hack at the moment (which I don't love) which should protect the PixelData implementation in the readValue switch. That is, we assign a special "VR" internal enum "VRKind" that differentiates PixelData from other tags with OW/OB here so PixelData tags should always be routed to readPixelData anyway.

Feel free to update the PR when you have a chance, and thank you for your contribution!

May 21 '25 23:05 suyashkumar

I've been digging in on this a bit. I believe the way that sequences are parsed for UN values is slightly incorrect, and so doesn't work in this case.

Unfortunately I'm not at liberty to share the full file I'm working with (even anonymized), but I can share the problematic segment.

The transfer syntax for this file is JPEG Lossless, Non-Hierarchical, First-Order Prediction (Process 14 [Selection Value 1]) -notably, an explicit transfer syntax.

000013d0: 424f 0000 0000 0000 0049 0010 4f4c 0014   OB......I...LO..
000013e0: 4547 534d 435f 5f54 4143 4452 4149 5f43   GEMS_CT_CARDIAC_
000013f0: 3030 2031 0049 1001 424f 0000 ffff ffff   001 I...OB......
00001400: fffe e000 008a 0000 0049 0010 0014 0000   ........I.......
00001410: 4547 534d 435f 5f54 4143 4452 4149 5f43   GEMS_CT_CARDIAC_
00001420: 3030 2031 0049 1002 0002 0000 3336 0049   001 I.......63I.

There's a private creator tag (0049,0010) with type LO with length 0014, and a value GEMS_CT_CARDIAC_001
followed by the private tag (0049,1001) which identifies itself as OB with length ffff ffff (undefined)
Within that, there's an item delimiter tag fffe e000 with length 008a.
Then, there is another private creator tag (0049,0010), similar to the one in the non-nested dataset. However, this is not followed by a VR, it jumps straight to the length (0014) and the value. This has dropped into ILE representation, despite the main dataset being explicit.

Trying to switch the library to read a sequence in this case results in an error as it expects the VR present, so misreads the sequence:

2025/05/22 09:15:45 readElement: tag: (fffe,e000)
2025/05/22 09:15:45 readElement: vr: NA
2025/05/22 09:15:45 readElement: vl: 138
2025/05/22 09:15:45 readElement: tag: (0049,0010)
2025/05/22 09:15:45 readElement: vr:
2025/05/22 09:15:45 readElement: vl: 0
2025/05/22 09:15:45 readElement: tag: (4547,534d)
2025/05/22 09:15:45 readElement: vr: _C
2025/05/22 09:15:45 readElement: vl: 24404
readElement: error when reading value for element (0049,1001): readSequence: error reading subitem in a sequence: readElement: error when reading value for element (fffe,e000): readSequenceItem: error reading subitem in a sequence item: readElement: error when reading value for element (4547,534d): error reading string element ((4547,534d)) value: unexpected EOF

This behaviour is as described in CP 246.

CP 246 only covers cases where the VR is UN, so doesn't apply directly to this situation, but it seems worth noting since this could be a separate bug, since by my reading the current handling of UN always uses the underlying reader's "explicitness" and doesn't do this switch to implicit.

I've checked the behaviour of how a few other libraries handle my case.

pydicom falls to the case here.

This first tries (and fails) to read encapsulated pixel data, then essentially scans for the sequence delimiter tag and puts the entire result in the element's data value. This seems to match the intended behaviour as described in PS3.5 7.1.2.

We could do something similar using bufio.Reader's ReadBytes, but it's tricky to have that respect the dicomio.Reader's limit and updating bytesRead appropriately.

fo-dicom falls to the case here.

This alternates reading a sequence item tag, and then reading the item's length into a "fragment" (that ends up being the element's data value).

These two approaches end up with different values (pydicom's implementation ends up with a byte array of length 146, while fo-dicom's ends up with a byte array of length 138, since it doesn't include the item delimiter tag). My interpretation is that pydicom's is more close to what the spec calls for, but would be harder to implement because of the semantics of dicomio.Reader.

As whatever decision is made is yours to maintain, I'd like your opinion before implementing this.

May 22 '25 19:05 patrickwhite256

@patrickwhite256 thank you for the deep dive into this, really helpful!

My initial take is to not follow pydicom's behavior. To me it makes more sense to hold the raw data without also including the dicom item tags, since that's the useful "payload" of this element. (Let me know if I'm missing something). It feels to me that the dicom item tags are just "packaging" and not actual content. I don't feel that strongly, though.

In terms of implementation, can we make use of readRawItem?

We'd need to detect this circumstance in readValue and dispatch to a reader for these kinds of raw items, if I understand correctly.

Question for you:

Within that, there's an item delimiter tag fffe e000 with length 008a.

Is this a item delimiter tag or is this an item tag?

Does your hex dump show where the actual item delimitation tag is? Or is that later (or non-existent)? Trying to understand if we need to read until the item delimitation tag OR if we just read one raw item with the VL it includes. (I guess that we should read to the item delimitation tag and that we may have multiple items right)?

Sorry if I'm missing something--tried to do a quick pass on this.

May 23 '25 20:05 suyashkumar

We might be able to do something like the below. Though I wonder if we should be concatenating allData into a single byte slice if there are multiple items. OB's output is supposed to be a giant byte slice, so keeping it that way would keep things consistent. Do you see multiple items in any of your examples?

Thanks again!

func (r *reader) readOBUndefinedLength(t tag.Tag, vr string) (Value, error) {
    var allData []byte
    
    // Keep reading items until sequence delimitation
    for !r.rawReader.IsLimitExhausted() {
        itemData, endOfItems, err := r.readRawItem(false)
        if err != nil {
            return nil, err
        }
        
        if endOfItems {  // Found fffe e0dd
            break
        }
        
        allData = append(allData, itemData...)  // Concatenate all items
    }
    
    return &bytesValue{value: allData}, nil
}

May 23 '25 21:05 suyashkumar

Is this a item delimiter tag or is this an item tag?

Sorry, you're right - I meant "item tag".

Does your hex dump show where the actual item delimitation tag is? Or is that later (or non-existent)

The sequence delimitation is included later. I cut the snippet for brevity. Here's the later part containing the item delimitation tag at the expected spot, 138 bytes later (I've omitted some data, marked by -):

00001490: ---- fffe e0dd 0000 0000 0053 0010 4f4c   --........S...LO

I haven't seen any examples with multiple items, but I agree that concatenating them would be a sensible thing to do. I'm not sure it makes sense for there to be multiple items based on the wording of PS3.5 7.1.2, but I don't think it would hurt to account for multiple items.

I think your suggestion is a good solution for this issue.

May 23 '25 21:05 patrickwhite256

Fantastic, sounds great! Feel free to put up / update the PR if you have time! (If not, let me know, and happy to help).

May 23 '25 21:05 suyashkumar

I've updated the PR! I think I'll file a separate bug for the other issue (incorrect "explicit"-ness in UN elements).

May 26 '25 16:05 patrickwhite256