tidy3d
tidy3d copied to clipboard
Add metadata to tidy3d components
The following is coming from a user's feedback:
When doing a batch for parameter sweep, users need to actively keep track of the parameter values used for each simulation. If we reload the BatchData in the future, there is no record of the parameter values used, which leads to difficulties in data analysis.
Would be good to add a comment
in td.Simulation such that users can store reminder information when creating a simulation. This information can be carried into simulation data and extracted when doing data analysis to remind themselves. For example,
sim = td.Simulation(center=xxx,..., comment="a=3, b=4") sim_data = web.run(xxx) print(sim_data.comment)
There are many alternatives but the general idea is to store some user-specified information somewhere that can be retrieved later on as a reminder.
We discussed an idea like this regarding the parameter sweep tool. I think it could be useful to allow users to append some arbitrary metadata to a Simulation
(or any tidy3d component for that matter). In xarray/pandas (probably others), this typically is called .attrs
and is a dict
. https://docs.xarray.dev/en/latest/generated/xarray.DataArray.attrs.html
Maybe we could follow that convention too? I guess some besides naming, we should also consider whether there would be any security issues with allowing this data to be uploaded to our servers? Should we validate it in any way? Also, should we retain attrs
if the object is modified (eg updated_copy()
)?
@tomflexcompute once the design
sweep tool is related in 2.6.0rc1 you can tell the user that they can use it. It basically does exactly what they want (the inputs are associated with the outputs and there are references to the simulation data, it basically does all of this accounting under the hood).
We discussed an idea like this regarding the parameter sweep tool. I think it could be useful to allow users to append some arbitrary metadata to a
Simulation
(or any tidy3d component for that matter). In xarray/pandas (probably others), this typically is called.attrs
and is adict
. https://docs.xarray.dev/en/latest/generated/xarray.DataArray.attrs.html
Thanks for sharing your thoughts @tylerflex . That sounds good to me.
Maybe we could follow that convention too? I guess some besides naming, we should also consider whether there would be any security issues with allowing this data to be uploaded to our servers? Should we validate it in any way? Also, should we retain
attrs
if the object is modified (egupdated_copy()
)?
Why are there security concerns? Intuitively I think if a user copies a simulation, the comment should be copied as well. They need to update it if they want.
@tomflexcompute once the
design
sweep tool is related in 2.6.0rc1 you can tell the user that they can use it. It basically does exactly what they want (the inputs are associated with the outputs and there are references to the simulation data, it basically does all of this accounting under the hood).
Yes that's a great feature. Maybe I missed this when reviewing the design tutorial notebook, could you remind me if there is a convenient save/load mechanism for sweep results? I think this particular user's concern is regarding reusing previous batch results long after the simulations were run. Also is a Results
object a xarray? Can users slice the results similarly to the regular SimulationData?
Why are there security concerns?
I'm mostly afraid of a malicious user putting a function in the metadata that would run some system level commands if executed. I can't think of a way this could get run accidentally, but I just want to be 100% sure before allowing any arbitrary data to be uploaded to us.
could you remind me if there is a convenient save/load mechanism for sweep results?
sweep Result
objects can be saved/loaded .to_file()
or from_file()
like any other tidy3d objects. Also, when converted to_dataframe()
the DataFrame
contains a column for each input and the task_id information I think is stored as metadata in the .attrs
. This DataFrame
can be saved and loaded to csv or other format.
I think this particular user's concern is regarding reusing previous batch results long after the simulations were run.
This is where a good caching feature would come in. The user could just run a "larger" batch and it would automatically load the data that has already been seen. In the current design
plugin, we don't handle this automatically, but users can run two separate sweep and Result.combine(other)
to combine them. There's a demo of this in the tutorial (redundant data points I think will be skipped in the combined result).
Also is a Results object a xarray?
It's basically a regular tidy3d
component like all of the others, but just stores
a. the dimensions of the data (dims
), basically the names of the function arguments.
b. a list of all of the input arguments for each point
c. a list of all of the output values for each data point
There are some convenience methods, but the recommended thing is to just export to pandas.DataFrame
with to_dataframe
as it has a lot more functionality.
Can users slice the results similarly to the regular SimulationData?
Unfortunately not really, the problem is that xarray
assumes a lot of redundant coordinate values (eg. data on a np.meshgrid()
of points). whereas with the sweep we can have independent coordinates for each point.
If the user ran a grid search (instead of a Monte Carlo), it could make some sense to convert to xarray
, though. I hadn't considered that. Perhaps I can add that as a feature request for rc2 or the next release.
Thanks for the clarification. In the tutorial, the result data is a scaler. What if the user wants to plot a spectrum or 2D field distribution for each simulation? Pandas dataframe doesn't handle higher dimensional data very well I would assume.
In the tutorial, the result data is a scaler.
Just to clarify, the issue with xarray
vs. pandas.DataFrame
isn't so much about the result data type, but more how the input coordinates are related. In xarray
you would expect the same coordinates to repeat for a lot of the data points (like o a grid) whereas in pandas
, it's an "unstructured" dataset, so each point is effectively independent.
What if the user wants to plot a spectrum or 2D field distribution for each simulation?
The user can still add monitor
s to the simulation if they want to go in later and inspect the data. It just might be a couple steps of grabbing the task id, loading the data, and then plotting it.
Pandas dataframe doesn't handle higher dimensional data very well I would assume.
Higher dimensional is ok, but where it's really best is for doing statistics / aggregation on this higher dimensional data and plotting it. Here are a bunch of examples of what you can do for visualization: https://pandas.pydata.org/docs/user_guide/visualization.html
In general, I think the tool is designed more right now to allow users to calculate and store the "post processed" data, but it's still not a great tool for eg. plotting batches of regular SimulationData. If that makes sense.
It would be good to improve both the batch plotting and the accessing of SimulationData in future versions
Assigned @tylerflex just so this doesn't go hanging.
@momchil-flex I know if we need / want to have everything assigned. What makes some sense to me is we spend some time every few weeks (or after every release) to go through and just assign out the highest priority items to the people who make the most sense and have bandwidth for them, rather than pre-assign every small issue. This way there are some issues floating around unassigned that various people can pick up if they have time rather than them getting stuck on a single person. What do you think?
also, regarding this issue, we might want to just close it until further notice as I think the design
plugin probably covers the use case.
@momchil-flex I know if we need / want to have everything assigned. What makes some sense to me is we spend some time every few weeks (or after every release) to go through and just assign out the highest priority items to the people who make the most sense and have bandwidth for them, rather than pre-assign every small issue. This way there are some issues floating around unassigned that various people can pick up if they have time rather than them getting stuck on a single person. What do you think?
I think that could work if we set up a way to ensure we are regularly doing it. So far we've had cases where small things that should be done because they're fairly fast even if not very high importance were left for quite a while, so this is what I'm trying to mitigate by at least assigning someone.
I think the "design" will not cover all the use cases, and it will still be convenient to be able to attach a small description or simple parameters for individual simulations, for design works not substantial/systematic enough to use "design" plugin, or for situations of manual iterations (basically tuning one or two parameters and running simulations one by one). @tylerflex
If we have a metadata that can be a dict, we could put the jax_info there for adjoint simulations. That way we won't have to upload it as a separate file, which would be cleaner in several ways (including containing everything in one simulation file). Alternatively having an optional jax_info
field could do the same job, but maybe it's fine to be in the metadata if it's introduced in a way that allows for it?
@tylerflex I think we should bump the priority on this? It has been requested by several users.
yea I think we can do this for 2.7 I was also thinking it would be nice to have a metadata field for my adjoint refactor I want to do for 2.7. I can take care of it.