Combine datasets and observations

Open yngve-sk opened this issue 1 year ago • 1 comments

Issue Resolves #7324

Main motivations:

Speeds up opening of datasets for multiple realizations
Makes matching up responses and observations require less for-loops
More explicitly stated correspondence between responses and observations wrt datasets

Approach This makes the dataset schema of observations and responses more uniform. The dimension name will correspond to the summary keyword / gen data name for both observations and response datasets. Observations will be in one dataset with an obs_name column for the name of the observation.

Responses are grouped by type, i.e., summary and gen_data.
Observations are grouped by response type: summary and gen_data
Refactor MeasuredData logic to one place

Known/open issues

Combined datasets use more memory, but are also faster. test_memory_smoothing (on my personal computer) takes 100sec, using less than 140MB of memory on main. With combined datasets, it takes 8-10sec, but also allocates 4GB of memory (which should be deallocated thereafter). The bottleneck is the combining of obs and responses, and can likely be optimized. IMHO though it is just relational table joining so a database would be suitable for this specific part of just joining tables (if worth the effort to look into). It is also worth looking into chunking up the datasets, dask dataframes etc but there seems to be some limitations.

Mar 21 '24 08:03 yngve-sk

Also a general comment, this will require a bump in the storage version, and a migration from version 5 -> 6

Apr 02 '24 12:04 oyvindeide