dicom-rs icon indicating copy to clipboard operation
dicom-rs copied to clipboard

Lazy loading of Multi-frame files

Open ibaryshnikov opened this issue 6 years ago • 16 comments

Hi @Enet4! Thank you for creating such an awesome library!

I have a specific use case: loading big .dcm files (over 500Mb) in browser. At the moment most of the browsers have a limit for webassembly memory of about 2Gb, and when I run my program, it usually crashes with out of memory error.

Is there a way we can collaborate to support lazy loading? In particular, I need to iterate over the PixelData items in Multi-frame file

ibaryshnikov avatar Jul 08 '19 18:07 ibaryshnikov

Hello! Thank you very much for your interest in this project!

While it's needless to say that this library isn't quite ready for use yet, I do intend to thrust development further into a first release in the near future. This one should have enough features to read most DICOM files in-memory. Lazy loading has been planned from the beginning in some way, but it might not be available in the first release. This should totally be supported eventually! After 0.1.0, I wish to conceive a roadmap with more concrete goals, that will allow me to lay down independent tasks and attract contributors in the process. Then, I would be more than glad to hear more from other experts and enthusiasts to drive this project forward.

Enet4 avatar Jul 08 '19 19:07 Enet4

I want to begin implementing this before completing #15.

There is a LazyDataSetReader currently implemented, but not being used. @Enet4, can you elaborate more on your intentions here? It looks like lazy loading is intended to be an "all or nothing" approach?

My thoughts:

  • Lazy loading should be built in to the standard RootDicomObject, regardless of the DicomObject being used (InMemDicomObject, or the presumed to-be-implemented LazyDicomObject). So, there should be a single DicomObject implementation.
  • The pivot to lazy load should be based on the input stream. Maybe create a custom trait the implements both Read and Seek, but with additional properties indicating if seek is supported, and the ability to build a factory token that can be used to rebuild the trait. We would provide helper functions to open a file stream, or a stream from an in-memory buffer, etc.
  • Another pivot point would be a parameter to indicate the maximum size value (in bytes) to load into memory, all larger elements being lazy loaded. A default value should be sane. This value would determine if an element should be lazy loaded, including OB, OW, and UT, etc. The default value would mostly catch Pixel Data values, in practice, but would also catch really large strings.
  • We should add a new PrimitiveValue::Lazy type that contains a pub enum LazyToken type. This token can be used to build a Read trait. This way, the values can stay unloaded, but can be partially buffered and written to another Write stream when writing to another file or network.
  • We should add a method to the DicomObject to load all lazy values into memory. This will modify any PrimitiveValue::Lazy to it's associated PrimitiveValue (based on VR).
  • The LazyToken will have a File(file_name, offset, num_of_bytes) value. There would be a method somewhere that would match this and return the appropriate Read trait.

I think this approach would ultimately remove the need for the InMemDicomObject, leaving just a DicomObject implementation.

Thoughts?

pauldotknopf avatar Sep 16 '19 04:09 pauldotknopf

That was nearly the whole idea behind it! A lazy DICOM object would keep track of the byte offset of larger elements from a data source and only fetch them on demand. I was working on a LazyDataSetReader because changes on the data set parser were needed to make this work efficiently. One could either finish this one or adapt the main DataSetReader to also be lazy and just eagerly read values when appropriate. More of my thoughts below.

Lazy loading should be built in to the standard RootDicomObject, regardless of the DicomObject being used (InMemDicomObject, or the presumed to-be-implemented LazyDicomObject). So, there should be a single DicomObject implementation.

I am not sure how this would work. The current purpose of the root DICOM object is to retain the Meta information table of objects loaded from a DICOM file. All methods of the interface are just delegated to the inner type. Can you expand a bit more on this idea?

It's worth noting that the trait DicomObject is certainly incomplete and will only make for the ideal DICOM object interface once generic associated types (GATs) land onto the language, which might take a while. Until then, some programs may just have to rely on specific implementation types or on reference types to overcome the lack of a generic lifetime on associated types.

The pivot to lazy load should be based on the input stream. Maybe create a custom trait the implements both Read and Seek, but with additional properties indicating if seek is supported, and the ability to build a factory token that can be used to rebuild the trait. We would provide helper functions to open a file stream, or a stream from an in-memory buffer, etc.

Sounds fair, although without trait specialization, this will have to resort to separate functions or builder types. In the long term, a generic factory type makes sense.

Another pivot point would be a parameter to indicate the maximum size value (in bytes) to load into memory, all larger elements being lazy loaded. A default value should be sane. This value would determine if an element should be lazy loaded, including OB, OW, and UT, etc. The default value would mostly catch Pixel Data values, in practice, but would also catch really large strings.

I agree that this should be done. Some values are too small to justify an indirection point.

We should add a new PrimitiveValue::Lazy type that contains a pub enum LazyToken type. This token can be used to build a Read trait. This way, the values can stay unloaded, but can be partially buffered and written to another Write stream when writing to another file or network.

I would rather design this the other way around: a LazyValue sum type which can either be Loaded(PrimitiveValue) or Unloaded(Token). This would preserve the concern of primitive values working specifically with in-memory representations.

We should add a method to the DicomObject to load all lazy values into memory. This will modify any PrimitiveValue::Lazy to it's associated PrimitiveValue (based on VR).

This continues the assumption that there would be a single DICOM object implementation, which I would like to hear more. In the event that none of the components of an object are lazy, a method to load all lazy values into memory could be implemented as a no-op. However, such an implementation in general may introduce an unnecessary overhead.

The LazyToken will have a File(file_name, offset, num_of_bytes) value. There would be a method somewhere that would match this and return the appropriate Read trait.

Not opposed to this, although I wonder if we could avoid repeating the file name across all elements. Wouldn't it also be better to keep the file open for the lazy implementation?

Enet4 avatar Sep 16 '19 09:09 Enet4

Thanks for the reply!

One could either finish this one or adapt the main DataSetReader to also be lazy and just eagerly read values when appropriate. More of my thoughts below.

I think I'd like to take a stab at adapting the current DataSetReader.

Lazy loading should be built in to the standard RootDicomObject, regardless of the DicomObject being used (InMemDicomObject, or the presumed to-be-implemented LazyDicomObject). So, there should be a single DicomObject implementation.

I am not sure how this would work. The current purpose of the root DICOM object is to retain the Meta information table of objects loaded from a DICOM file. All methods of the interface are just delegated to the inner type. Can you expand a bit more on this idea?

Sorry, I meant the DicomObject implementation. What I am getting at is that I think we should rename the current InMemDicomObject to StandardDicomObject, and have it serve all purposes.

The pivot to lazy load should be based on the input stream. Maybe create a custom trait the implements both Read and Seek, but with additional properties indicating if seek is supported, and the ability to build a factory token that can be used to rebuild the trait. We would provide helper functions to open a file stream, or a stream from an in-memory buffer, etc.

Sounds fair, although without trait specialization, this will have to resort to separate functions or builder types. In the long term, a generic factory type makes sense.

Ya. I'm thinking a specialized trait with all the info we need. The public API would only work with this specialized trait, but their will be factory methods to wrap simple Read instances, or open a file, etc.

We should add a new PrimitiveValue::Lazy type that contains a pub enum LazyToken type. This token can be used to build a Read trait. This way, the values can stay unloaded, but can be partially buffered and written to another Write stream when writing to another file or network.

I would rather design this the other way around: a LazyValue sum type which can either be Loaded(PrimitiveValue) or Unloaded(Token). This would preserve the concern of primitive values working specifically with in-memory representations.

Absolutely. Makes sense.

We should add a method to the DicomObject to load all lazy values into memory. This will modify any PrimitiveValue::Lazy to it's associated PrimitiveValue (based on VR).

This continues the assumption that there would be a single DICOM object implementation, which I would like to hear more. In the event that none of the components of an object are lazy, a method to load all lazy values into memory could be implemented as a no-op. However, such an implementation in general may introduce an unnecessary overhead.

Overhead with regards to performance? Or just implementation overhead? How would you deal with future tasks, such as DataSetWriter (which I'd like to get to next)? Separate write functions for both InMemDicomObject and LazyDicomObject? Or, move some lazy logic into the DataObject trait, at which point, why not just merge the concept at the implementation level?

The LazyToken will have a File(file_name, offset, num_of_bytes) value. There would be a method somewhere that would match this and return the appropriate Read trait.

Not opposed to this, although I wonder if we could avoid repeating the file name across all elements. Wouldn't it also be better to keep the file open for the lazy implementation?

Good point. Considering that, in practice, only the PixelData frames would be lazy, is it worth hurting the ergonomics of having to pass an instance of an open file handle throughout the API? For example, when I implement store SCP, there will be callbacks for fn on_object_received(dicom_object: RootDicomObject<StandardDicomObject>). I wouldn't like having to pass around the file name in the API as well. Library consumers should just work with it directly and not care where the object came from.

pauldotknopf avatar Sep 16 '19 13:09 pauldotknopf

With regard to the last point, I think there should also be the option of "loading on demand" these values. Requiring an optional Read trait/instance on all the relevant StandardDicomObject methods for the internal "load on demand" would be weird to the library consumer, IMO.

I'm a big fan of dcmtk's approach here. That concern isn't visible at all in the public API.

pauldotknopf avatar Sep 16 '19 13:09 pauldotknopf

Sorry, I meant the DicomObject implementation. What I am getting at is that I think we should rename the current InMemDicomObject to StandardDicomObject, and have it serve all purposes.

This would mean that the standard implementation would always consider the possibility of a value being stored somewhere else. The overhead would come from testing the lazy value containers for something that, in some cases, will always be there. I am not opposed to LazyDicomObject becoming the default type for most cases, but InMemDicomObject is best left with a pure in-memory strategy for now.

Ya. I'm thinking a specialized trait with all the info we need. The public API would only work with this specialized trait, but their will be factory methods to wrap simple Read instances, or open a file, etc.

We can't generically implement a trait for all Read + ?Seek types. It would have to be exhaustively implemented for all types in the standard library, and that would miss out on types in other crates. The way I see it, these constraints have to be outlined by the factory function (or the Options/Builder type).

fn from_reader<R>(reader: R) -> InMem; // no seek
fn from_seek_reader<R: Read + Seek>(reader: R) -> Lazy; // can seek
fn open_file<P: AsRef<Path>>(path: P) -> Lazy; // can seek in a file

We can then provide clear instructions on when to use which.

How would you deal with future tasks, such as DataSetWriter (which I'd like to get to next)? Separate write functions for both InMemDicomObject and LazyDicomObject? Or, move some lazy logic into the DataObject trait, at which point, why not just merge the concept at the implementation level?

The way I'd tackle this is build internal functions for writing an iterator of DICOM elements to a writer: impl Writer based on a given transfer syntax. Since all DICOM object implementations should implement element iteration, we get a common implementation of the writing logic regardless of whether the object is lazy or not, and all of them could rely on it to expose some public API, like a save method. I hope this doesn't require GATs. :thinking:

Good point. Considering that, in practice, only the PixelData frames would be lazy, is it worth hurting the ergonomics of having to pass an instance of an open file handle throughout the API? For example, when I implement store SCP, there will be callbacks for fn on_object_received(dicom_object: RootDicomObject<StandardDicomObject>). I wouldn't like having to pass around the file name in the API as well. Library consumers should just work with it directly and not care where the object came from.

I agree that working with an object should not require keeping the file name around separately, and this so far has not been put to question. What I meant is that the resource keeping the file descriptor would be stored alongside the other fields of the lazy DICOM object. And through the main API, the user never gets to see a LazyToken when retrieving an attribute, it would be resolved to a loaded attribute via interior mutability and handed over through a suitable reference type.

Enet4 avatar Sep 16 '19 15:09 Enet4

Sorry, I meant the DicomObject implementation. What I am getting at is that I think we should rename the current InMemDicomObject to StandardDicomObject, and have it serve all purposes.

This would mean that the standard implementation would always consider the possibility of a value being stored somewhere else. The overhead would come from testing the lazy value containers for something that, in some cases, will always be there. I am not opposed to LazyDicomObject becoming the default type for most cases, but InMemDicomObject is best left with a pure in-memory strategy for now.

Is there something beyond a match case for the Loaded(PrimitiveValue) or Unloaded(Token) values? Maybe I'm missing something, but I'm not understanding the overhead you're referring to.

How would you deal with future tasks, such as DataSetWriter (which I'd like to get to next)? Separate write functions for both InMemDicomObject and LazyDicomObject? Or, move some lazy logic into the DataObject trait, at which point, why not just merge the concept at the implementation level?

The way I'd tackle this is build internal functions for writing an iterator of DICOM elements to a writer: impl Writer based on a given transfer syntax. Since all DICOM object implementations should implement element iteration, we get a common implementation of the writing logic regardless of whether the object is lazy or not, and all of them could rely on it to expose some public API, like a save method. I hope this doesn't require GATs.

This has an assumption that breaks my need of never actually loading a value when writing to another writer. Let me explain. This iterator you're speaking of wouldn't expose the concept of Lazy. The implementations would have to fully load the item, on demand. But, these values can be up to 4GB. It would be much more efficient if the DataSetWriter knew if the underlying value was lazy, so that it could load a Read trait for it and buffer the bytes into the Write trait, with say, a 4MB buffer. This would leave the underlying value of the DICOM object still being Unloaded, which is ideal.

So to reiterate, IMO, the DataSetWriter has to have knowledge of a value being unloaded/loaded, so that it could more efficiently buffer the bytes from Read into Write. Think of uploading 4GB files to a network stream (C-STORE).

So we need to either have twoDataSetWriters for the both InMemDicomObject and the LazyDicomObject, or merge the concept of lazy/memory objects into a single StandardDicomObject with a single DataSetWriter.

I agree that working with an object should not require keeping the file name around separately, and this so far has not been put to question. What I meant is that the resource keeping the file descriptor would be stored alongside the other fields of the lazy DICOM object. And through the main API, the user never gets to see a LazyToken when retrieving an attribute, it would be resolved to a loaded attribute via interior mutability and handed over through a suitable reference type.

Ok, so the difference between our thoughts is that I'd like to keep the path to the file and open on demand, whereas you'd like to keep the handle of the file itself?

There would be multiple open read handles, one for each unloaded lazy value. I'm not sure that is a good idea. This might be problematic if in the future, we decide that we'd like to to be able to modify PixelData values without loading the entire image into memory, which we couldn't do if there was another open read handle to another lazy value. Also, what about async operations? Multiple threads seeking a single file handle? Multi-frame images? Maybe I'm over thinking this.

pauldotknopf avatar Sep 16 '19 19:09 pauldotknopf

I think we're getting closer to common grounds, hang in there!

Is there something beyond a match case for the Loaded(PrimitiveValue) or Unloaded(Token) values? Maybe I'm missing something, but I'm not understanding the overhead you're referring to.

That would be the first that comes to mind. But in order to be able to turn an unloaded element into a loaded element transparently through the public attribute API, we need interior mutability. This, I believe, would be best achieved by wrapping a RefCell around the lazy value, which also adds a few more runtime checks.

It would be much more efficient if the DataSetWriter knew if the underlying value was lazy, so that it could load a Read trait for it and buffer the bytes into the Write trait, with say, a 4MB buffer. This would leave the underlying value of the DICOM object still being Unloaded, which is ideal.

So to reiterate, IMO, the DataSetWriter has to have knowledge of a value being unloaded/loaded, so that it could more efficiently buffer the bytes from Read into Write. Think of uploading 4GB files to a network stream (C-STORE).

Ah, that is an excellent point. So, we must be able to retrieve a byte stream from an element, without requiring it to be fully in memory. We can try and generalize the concept of a DICOM element type into something like an AttributeEntry trait, for values created upon attribute() and which encompass streamed reading, as well as value loading and casting from textual values. There may also be a way for in-memory entries to impl this trait, so we'd make that work for any type of DICOM object.

There would be multiple open read handles, one for each unloaded lazy value. I'm not sure that is a good idea. This might be problematic if in the future, we decide that we'd like to to be able to modify PixelData values without loading the entire image into memory, which we couldn't do if there was another open read handle to another lazy value.

Having multiple read handles does sound like too much, that is not what I intended. We can instead give users of attribute() an attribute entry which includes a borrow of the handle. The compiler will automatically prevent access to multiple lazy attributes at a time, and if this is a problem, we resort once again to interior mutability for the file handle.

Also, what about async operations? Multiple threads seeking a single file handle? Multi-frame images? Maybe I'm over thinking this.

If I understood correctly, you would like to be able to read multiple frames via multiple threads? These RefCells for obtaining interior mutability make it so that a lazy DICOM object isSend, but not Sync. Concurrent file reading, assuming it is an important use case, imposes yet another requirement that I did not quite predict so far. As long as we maintain a flexible public API, we should be able to maintain multiple implementations and evaluate them side by side. One of them could be a DICOM object supporting concurrent value reading.

Enet4 avatar Sep 16 '19 20:09 Enet4

Is there something beyond a match case for the Loaded(PrimitiveValue) or Unloaded(Token) values? Maybe I'm missing something, but I'm not understanding the overhead you're referring to.

That would be the first that comes to mind. But in order to be able to turn an unloaded element into a loaded element transparently through the public attribute API, we need interior mutability. This, I believe, would be best achieved by wrapping a RefCell around the lazy value, which also adds a few more runtime checks.

Ah, I see. I don't know the exact runtime costs of RefCell, but are we sure it's worth the bifurcation?

It would be much more efficient if the DataSetWriter knew if the underlying value was lazy, so that it could load a Read trait for it and buffer the bytes into the Write trait, with say, a 4MB buffer. This would leave the underlying value of the DICOM object still being Unloaded, which is ideal. So to reiterate, IMO, the DataSetWriter has to have knowledge of a value being unloaded/loaded, so that it could more efficiently buffer the bytes from Read into Write. Think of uploading 4GB files to a network stream (C-STORE).

Ah, that is an excellent point. So, we must be able to retrieve a byte stream from an element, without requiring it to be fully in memory. We can try and generalize the concept of a DICOM element type into something like an AttributeEntry trait, for values created upon attribute() and which encompass streamed reading, as well as value loading and casting from textual values. There may also be a way for in-memory entries to impl this trait, so we'd make that work for any type of DICOM object.

Yeah, that sounds like a good approach. Back to your previous point about RefCell performance, couldn't we work with the AttributeEntry trait to also ensure that if a value is eagerly loaded, that no interior RefCell is used/needed? Therefor, removing your run-time performance concerns of using RefCell? Two implementations for AttributeEntry, a ConcreteAttributeEntry, and a LazyAttributeEntry?

Having multiple read handles does sound like too much, that is not what I intended. We can instead give users of attribute() an attribute entry which includes a borrow of the handle. The compiler will automatically prevent access to multiple lazy attributes at a time, and if this is a problem, we resort once again to interior mutability for the file handle.

I see. I get where you are coming from, but this sounds like territory in which I'd need to be a little bit more comfortable in writing idiomatic Rust. I'm a quick learner though ;)

If I understood correctly, you would like to be able to read multiple frames via multiple threads? These RefCells for obtaining interior mutability make it so that a lazy DICOM object isSend, but not Sync. Concurrent file reading, assuming it is an important use case, imposes yet another requirement that I did not quite predict so far. As long as we maintain a flexible public API, we should be able to maintain multiple implementations and evaluate them side by side. One of them could be a DICOM object supporting concurrent value reading.

Yeah, I'm over thinking it. Ignore that bit.

pauldotknopf avatar Sep 16 '19 20:09 pauldotknopf

Back to your previous point about RefCell performance, couldn't we work with the AttributeEntry trait to also ensure that if a value is eagerly loaded, that no interior RefCell is used/needed? Therefor, removing your run-time performance concerns of using RefCell? Two implementations for AttributeEntry, a ConcreteAttributeEntry, and a LazyAttributeEntry?

That is precisely what I had in mind! It would then be defined for each DICOM object impl using an associated type. Ideally, this associated type would be generic over the object's lifetime, but we'll have to make do without that for the time being.

I see. I get where you are coming from, but this sounds like territory in which I'd need to be a little bit more comfortable in writing idiomatic Rust. I'm a quick learner though ;)

No worries, I'd be more than willing to mentor the development of these mechanisms, or take the matter by my own hands if it gets overly difficult.

Enet4 avatar Sep 17 '19 08:09 Enet4

Lazy loading of any DICOM file is one of the topics I wish to tackle next, as this inevitably requires a bit of tweaking in nearly all crates of the library (changes to the core crate are usually viral).

Enet4 avatar May 17 '20 15:05 Enet4

@pauldotknopf This is next in my work queue, so I would like to keep you informed in case there was any eventual progress on your end or concerns you would like to bring up at this point.

I plan to perform the following major changes:

  1. DICOM object implementations will provide a way to retrieve an attribute instead of a data element directly. It would have a similar API to a data element, except that it would also provide an impl Reader for streamed access to the value.
  2. In order to support a lazy DICOM object, there will either be a new data set reader or a change to the existing one, so that it becomes an attached iterator (or streaming iterator) of tokens. This means that tokens will no longer need to hold the entire value in memory, as they will be able to have exclusive access to the inner reader source when needed.

Lazy writing might be left as a separate task for now.

Enet4 avatar Sep 14 '20 11:09 Enet4

@Enet4 Hi! Has there been any progress on this issue? I am using this library in a project and am very happy with it. The only thing missing (for my usecase) is the posibillity to load frames individually.

eseaflower avatar Feb 19 '21 13:02 eseaflower

@eseaflower Thank you for reaching out. I have some developments in a separate branch for an alternative parser, but an implementation of a DICOM object with lazy losing is still in the works, and may take a while before it is working and usable. I can provide another status update as more progress is made in the future.

Enet4 avatar Feb 19 '21 13:02 Enet4

@Enet4 Thank you, I will be keen on hearing how you proceede.

eseaflower avatar Feb 19 '21 13:02 eseaflower

With #126 merged, work for an implementation of a lazily loaded DICOM object has gone one step further. A look-up table of byte positions will be built alongside the first reading pass, which will enable seeking back to a previously unloaded value and fetch it on demand.

Enet4 avatar Jun 06 '21 10:06 Enet4