nibabel icon indicating copy to clipboard operation
nibabel copied to clipboard

WIP: Add memory efficient meta data summary

Open moloney opened this issue 4 years ago • 10 comments

This is some work-in-progress for adding data structures for creating a memory efficient summary of a sequence of meta data dictionaries (assuming a large number of keys/values repeat) and then using this to determine how to sort the associated images into an nD array.

This approach was inspired by this dcmstack issue.

moloney avatar Jul 09 '21 17:07 moloney

Hello @moloney, Thank you for updating!

Line 568:101: E501 line too long (102 > 100 characters)

To test for issues locally, pip install flake8 and then run flake8 nibabel.

Comment last updated at 2021-07-13 03:30:41 UTC

pep8speaks avatar Jul 09 '21 17:07 pep8speaks

Codecov Report

Merging #1030 (bf8ecfc) into master (ea68c4e) will decrease coverage by 1.21%. The diff coverage is 58.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1030      +/-   ##
==========================================
- Coverage   92.26%   91.04%   -1.22%     
==========================================
  Files         100      101       +1     
  Lines       12205    12668     +463     
  Branches     2136     2267     +131     
==========================================
+ Hits        11261    11534     +273     
- Misses        616      781     +165     
- Partials      328      353      +25     
Impacted Files Coverage Δ
nibabel/metasum.py 58.96% <58.96%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ea68c4e...bf8ecfc. Read the comment docs.

codecov[bot] avatar Jul 09 '21 17:07 codecov[bot]

Merged master in to resolve conflicts and get the tests going. Let me know if you'd prefer I didn't do that.

effigies avatar Jul 12 '21 20:07 effigies

@matthew-brett / @moloney / @effigies If DICOM related functionality is moved to dicom_parser, do you still think the MetaSummary implementation will be required? I feel like we could simply cache a dictionary of lazily evaluated header values within each Series instance. The higher level Dataset class (to be implemented) can simply query those.

ZviBaratz avatar Aug 08 '21 17:08 ZviBaratz

@ZviBaratz Can you explain in more detail what you have in mind? I don't see how a cache helps to solve the problem of determining what meta data is varying when someone hands us a list of Dicom files we have never seen before (that could come from multiple Dicom series).

moloney avatar Aug 09 '21 16:08 moloney

The idea is that there will be a Dataset class which will receive a root directory and iterate its files to create the representations for the contained series. When a user tries to query based on any particular header field, the dataset queries all the created Series instances headers to retrieve the value (at which point it could be saved to a cache dictionary in order to avoid repeating computations). Of course some evaluation time is to be expected, but I don't think it should be anything too bad up to a few dozen series. If you're working with more than that, it might be best to export the metadata to some external table anyway.

ZviBaratz avatar Aug 09 '21 17:08 ZviBaratz

We really don't want to require all the files live in a single directory. The assumption is you are passed a list of files that could be massive even for a single series (e.g. 36K) that you have never seen before and you want to efficiently convert them into an xarray on the fly. My original implementation in dcmstack wasn't totally naive, meta data values that were constant were only stored once, and yet it required orders of magnitude more memory (18GB vs ~800MB with 36K files) compared to this approach.

moloney avatar Aug 09 '21 17:08 moloney

I see. I'll be working on the issues that are already piling up in dicom_parser for the next couple of weeks, after that I'll start thinking on how this would best be integrated into dicom_parser. We could discuss it in more detail in our next meeting.

ZviBaratz avatar Aug 09 '21 17:08 ZviBaratz

If we want to support using multiprocessing to speed up the parsing of very large series, this would also provide a nice compact representation to pass around.

moloney avatar Aug 09 '21 17:08 moloney

Sorry, I lost track of this one. What's the status? Are we still trying to get this into nibabel?

effigies avatar Mar 03 '22 21:03 effigies