tidy3d Allow loading/download part of the data

Is your feature request related to a problem? Please describe. The hdf5 data size is sometimes too large to fit into memory.

Describe the solution you'd like

Be able to load only part of the hdf5 data into Python environment (needed)
Be able to download only part of the hdf5 data from server (nice to have)

Oct 08 '22 22:10 xin-flex

Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example

flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')

source

Oct 10 '22 11:10 tylerflex

Actually this is not exactly how this works, it is not exactly equivalent to the load_from_group / save_from_group in current develop. That is because current develop preserves the entire model structure in the hdf5, so if I for example save a SimulationData object that has a MonitorData, I can load the MonitorData from its corresponding group which has exactly the same structure/data as if I did MonitorData.to_hdf5.

In the reorg, only the json string of the model at the top level is stored. So in the SimulationData hdf5, only the SimulationData json is available, and you cannot load the MonitorData individually. The reason I introduced the group_path kwarg is so that you can store multiple models in the same file (something we do on the backend), e.g. something like

for monitor_data in monitor_data_list:
    monitor_data.to_hdf5("my.hdf5", group_path=monitor_data.monitor.name)

In this case, you can selectively load a single one of those from the "my.hdf5" file like you say, e.g. flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/flux_monitor/').

We should think about whether and how to handle this from a SimulationData file though.

Oct 10 '22 17:10 momchil-flex

Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example
flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')

Note that your test works because you're loading a FluxDataArray which has a simple from_hdf5 method that directly loads the data only (it doesn't use Tidy3dBaseModel.from_hdf5). Still, it means that without us having to do anything, the user can fairly easily load DataArrays if not whole datasets.

Oct 11 '22 00:10 momchil-flex

So I just added a test that I think illustrates what you're saying in which we try to load a MonitorData directly out of a file containing a SimulationData. It indeed failed because it tries to use the SimulationData json to load the monitor data and ends up getting the group path all wrong.

Is this illustrative of the problem you are explaining above?

I fixed the test by adding something to dict_from_hdf5 in which we select the correct model_dict from the top level json string using the group_path

To illustrate the steps:

pulse = GaussianPulse.from_file('source.hdf5', group_path='/source_time')
# first, grab the `json_dict` for the source at the top level
# then access `json_dict[source_time]`
# also, access the hdf5 group `f_handle[source_time]`
# proceed as normal

See the changes in this commit

Hopefully this resolves at least part of the concern?

Oct 11 '22 08:10 tylerflex

I think this works now yeah.

The second (optional) request is to be able to only download part of the data. I think this may eventually be coupled with the denormalizer.

Oct 11 '22 23:10 momchil-flex

Yea, for download, we probably need changes to the web api for example?

Oct 12 '22 07:10 tylerflex

@dbochkov-flexcompute any thoughts on this as part of the denormalizer efforts?

Feb 09 '23 15:02 tylerflex

I guess I see these options so far:

Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.
Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.
Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side

Feb 09 '23 18:02 dbochkov-flexcompute

@xin-flex any thoughts?

Feb 09 '23 19:02 tylerflex

I guess I see these options so far:

Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.

Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.

Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side

For 1. the method to divide data needs to be predefined and lacks flexibility, and takes more storage For 2. it is probably not very useful for the user For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)

Mar 17 '23 00:03 xin-flex

For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)

if we can do some processing on server side, then a simpler approach is probably to just open and save separately data for requested monitor from the non-denormalizer simulation data, something like:

mnt_data = td.SimulationData.from_file('simulation.hdf5', group_path='/data/3/')
mnt_data.to_file('mnt_data.hdf5')
# then download mnt_data.hdf5 to user and delete afterwards

Mar 18 '23 01:03 dbochkov-flexcompute

agree

Mar 20 '23 19:03 xin-flex

what's the status of this issue? are we still going to work on this? or saving for later?

Jul 03 '23 10:07 tylerflex

I think this is still worthwhile to have, maybe not very urgent though

Jul 13 '23 22:07 xin-flex

This is partially solved by #1249 . still not possible to download part of the hdf5 file, but I dont know if we want to allow that. should we close?

Dec 15 '23 19:12 tylerflex

tidy3d tidy3d copied to clipboard

Allow loading/download part of the data

tidy3d
tidy3d copied to clipboard