tidy3d
tidy3d copied to clipboard
Allow loading/download part of the data
Is your feature request related to a problem? Please describe. The hdf5 data size is sometimes too large to fit into memory.
Describe the solution you'd like
- Be able to load only part of the hdf5 data into Python environment (needed)
- Be able to download only part of the hdf5 data from server (nice to have)
Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to from_hdf5() for example
flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')
Actually this is not exactly how this works, it is not exactly equivalent to the load_from_group / save_from_group in current develop. That is because current develop preserves the entire model structure in the hdf5, so if I for example save a SimulationData object that has a MonitorData, I can load the MonitorData from its corresponding group which has exactly the same structure/data as if I did MonitorData.to_hdf5.
In the reorg, only the json string of the model at the top level is stored. So in the SimulationData hdf5, only the SimulationData json is available, and you cannot load the MonitorData individually. The reason I introduced the group_path kwarg is so that you can store multiple models in the same file (something we do on the backend), e.g. something like
for monitor_data in monitor_data_list:
monitor_data.to_hdf5("my.hdf5", group_path=monitor_data.monitor.name)
In this case, you can selectively load a single one of those from the "my.hdf5" file like you say, e.g. flux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/flux_monitor/').
We should think about whether and how to handle this from a SimulationData file though.
Note: in #535 you will be able to load an object stored within an hdf5 file (without loading everything) by supplying a path to
from_hdf5()for exampleflux_data = FluxData.from_hdf5('sim_data.hdf5', group_path='/data/3/')
Note that your test works because you're loading a FluxDataArray which has a simple from_hdf5 method that directly loads the data only (it doesn't use Tidy3dBaseModel.from_hdf5). Still, it means that without us having to do anything, the user can fairly easily load DataArrays if not whole datasets.
So I just added a test that I think illustrates what you're saying in which we try to load a MonitorData directly out of a file containing a SimulationData. It indeed failed because it tries to use the SimulationData json to load the monitor data and ends up getting the group path all wrong.
Is this illustrative of the problem you are explaining above?
I fixed the test by adding something to dict_from_hdf5 in which we select the correct model_dict from the top level json string using the group_path
To illustrate the steps:
pulse = GaussianPulse.from_file('source.hdf5', group_path='/source_time')
# first, grab the `json_dict` for the source at the top level
# then access `json_dict[source_time]`
# also, access the hdf5 group `f_handle[source_time]`
# proceed as normal
See the changes in this commit
Hopefully this resolves at least part of the concern?
I think this works now yeah.
The second (optional) request is to be able to only download part of the data. I think this may eventually be coupled with the denormalizer.
Yea, for download, we probably need changes to the web api for example?
@dbochkov-flexcompute any thoughts on this as part of the denormalizer efforts?
I guess I see these options so far:
- Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.
- Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.
- Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side
@xin-flex any thoughts?
I guess I see these options so far:
- Additionally to all smaller bits of data used in web UI, save data for each monitor into separate files, which could be downloaded if needed. However, it would increase storage usage by another ~100% in addtion to monitor_data.hdf5 (100%) + denormalized pieces (~100%), and probably used only in special situations.
- Don't add anything, and just provide an option to download data stored in denormalized pieces. However, those pieces are highly specific, say, real part of Ex component of a field at specific frequency.
- Similar to 2., but download all denormalized pieces related to a specific monitor, unpack them and merge into a full monitor data class. Potential issues I can see here: it could be a very large number of small files to download, and unpacking/merging pieces together would have to happen on user side
For 1. the method to divide data needs to be predefined and lacks flexibility, and takes more storage For 2. it is probably not very useful for the user For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)
For 3. it seems most ideal, but is it possible to do the unpacking/merging on server side? (downloading a large number of small files probably has performance issue I think?)
if we can do some processing on server side, then a simpler approach is probably to just open and save separately data for requested monitor from the non-denormalizer simulation data, something like:
mnt_data = td.SimulationData.from_file('simulation.hdf5', group_path='/data/3/')
mnt_data.to_file('mnt_data.hdf5')
# then download mnt_data.hdf5 to user and delete afterwards
agree
what's the status of this issue? are we still going to work on this? or saving for later?
I think this is still worthwhile to have, maybe not very urgent though
This is partially solved by #1249 . still not possible to download part of the hdf5 file, but I dont know if we want to allow that. should we close?