ert
ert copied to clipboard
Combine datasets and observations
Issue Resolves #7324
Main motivations:
- Speeds up opening of datasets for multiple realizations
- Makes matching up responses and observations require less for-loops
- More explicitly stated correspondence between responses and observations wrt datasets
Approach
This makes the dataset schema of observations and responses more uniform. The dimension name will correspond to the summary keyword / gen data name for both observations and response datasets. Observations will be in one dataset with an obs_name column for the name of the observation.
- Responses are grouped by type, i.e.,
summaryandgen_data. - Observations are grouped by response type:
summaryandgen_data - Refactor
MeasuredDatalogic to one place
Known/open issues
- Combined datasets use more memory, but are also faster. test_memory_smoothing (on my personal computer) takes 100sec, using less than 140MB of memory on main. With combined datasets, it takes 8-10sec, but also allocates 4GB of memory (which should be deallocated thereafter). The bottleneck is the combining of obs and responses, and can likely be optimized. IMHO though it is just relational table joining so a database would be suitable for this specific part of just joining tables (if worth the effort to look into). It is also worth looking into chunking up the datasets, dask dataframes etc but there seems to be some limitations.
Also a general comment, this will require a bump in the storage version, and a migration from version 5 -> 6