tidy3d icon indicating copy to clipboard operation
tidy3d copied to clipboard

Add metadata to tidy3d components

Open tomflexcompute opened this issue 1 year ago • 12 comments

The following is coming from a user's feedback:

When doing a batch for parameter sweep, users need to actively keep track of the parameter values used for each simulation. If we reload the BatchData in the future, there is no record of the parameter values used, which leads to difficulties in data analysis.

Would be good to add a comment in td.Simulation such that users can store reminder information when creating a simulation. This information can be carried into simulation data and extracted when doing data analysis to remind themselves. For example,

sim = td.Simulation(center=xxx,..., comment="a=3, b=4") sim_data = web.run(xxx) print(sim_data.comment)

There are many alternatives but the general idea is to store some user-specified information somewhere that can be retrieved later on as a reminder.

tomflexcompute avatar Jan 05 '24 03:01 tomflexcompute

We discussed an idea like this regarding the parameter sweep tool. I think it could be useful to allow users to append some arbitrary metadata to a Simulation (or any tidy3d component for that matter). In xarray/pandas (probably others), this typically is called .attrs and is a dict. https://docs.xarray.dev/en/latest/generated/xarray.DataArray.attrs.html

Maybe we could follow that convention too? I guess some besides naming, we should also consider whether there would be any security issues with allowing this data to be uploaded to our servers? Should we validate it in any way? Also, should we retain attrs if the object is modified (eg updated_copy())?

@tomflexcompute once the design sweep tool is related in 2.6.0rc1 you can tell the user that they can use it. It basically does exactly what they want (the inputs are associated with the outputs and there are references to the simulation data, it basically does all of this accounting under the hood).

tylerflex avatar Jan 05 '24 14:01 tylerflex

We discussed an idea like this regarding the parameter sweep tool. I think it could be useful to allow users to append some arbitrary metadata to a Simulation (or any tidy3d component for that matter). In xarray/pandas (probably others), this typically is called .attrs and is a dict. https://docs.xarray.dev/en/latest/generated/xarray.DataArray.attrs.html

Thanks for sharing your thoughts @tylerflex . That sounds good to me.

Maybe we could follow that convention too? I guess some besides naming, we should also consider whether there would be any security issues with allowing this data to be uploaded to our servers? Should we validate it in any way? Also, should we retain attrs if the object is modified (eg updated_copy())?

Why are there security concerns? Intuitively I think if a user copies a simulation, the comment should be copied as well. They need to update it if they want.

@tomflexcompute once the design sweep tool is related in 2.6.0rc1 you can tell the user that they can use it. It basically does exactly what they want (the inputs are associated with the outputs and there are references to the simulation data, it basically does all of this accounting under the hood).

Yes that's a great feature. Maybe I missed this when reviewing the design tutorial notebook, could you remind me if there is a convenient save/load mechanism for sweep results? I think this particular user's concern is regarding reusing previous batch results long after the simulations were run. Also is a Results object a xarray? Can users slice the results similarly to the regular SimulationData?

tomflexcompute avatar Jan 05 '24 15:01 tomflexcompute

Why are there security concerns?

I'm mostly afraid of a malicious user putting a function in the metadata that would run some system level commands if executed. I can't think of a way this could get run accidentally, but I just want to be 100% sure before allowing any arbitrary data to be uploaded to us.

could you remind me if there is a convenient save/load mechanism for sweep results?

sweep Result objects can be saved/loaded .to_file() or from_file() like any other tidy3d objects. Also, when converted to_dataframe() the DataFrame contains a column for each input and the task_id information I think is stored as metadata in the .attrs. This DataFrame can be saved and loaded to csv or other format.

I think this particular user's concern is regarding reusing previous batch results long after the simulations were run.

This is where a good caching feature would come in. The user could just run a "larger" batch and it would automatically load the data that has already been seen. In the current design plugin, we don't handle this automatically, but users can run two separate sweep and Result.combine(other) to combine them. There's a demo of this in the tutorial (redundant data points I think will be skipped in the combined result).

Also is a Results object a xarray?

It's basically a regular tidy3d component like all of the others, but just stores a. the dimensions of the data (dims), basically the names of the function arguments. b. a list of all of the input arguments for each point c. a list of all of the output values for each data point

There are some convenience methods, but the recommended thing is to just export to pandas.DataFrame with to_dataframe as it has a lot more functionality.

Can users slice the results similarly to the regular SimulationData?

Unfortunately not really, the problem is that xarray assumes a lot of redundant coordinate values (eg. data on a np.meshgrid() of points). whereas with the sweep we can have independent coordinates for each point.

If the user ran a grid search (instead of a Monte Carlo), it could make some sense to convert to xarray, though. I hadn't considered that. Perhaps I can add that as a feature request for rc2 or the next release.

tylerflex avatar Jan 05 '24 15:01 tylerflex

Thanks for the clarification. In the tutorial, the result data is a scaler. What if the user wants to plot a spectrum or 2D field distribution for each simulation? Pandas dataframe doesn't handle higher dimensional data very well I would assume.

tomflexcompute avatar Jan 05 '24 16:01 tomflexcompute

In the tutorial, the result data is a scaler.

Just to clarify, the issue with xarray vs. pandas.DataFrame isn't so much about the result data type, but more how the input coordinates are related. In xarray you would expect the same coordinates to repeat for a lot of the data points (like o a grid) whereas in pandas, it's an "unstructured" dataset, so each point is effectively independent.

What if the user wants to plot a spectrum or 2D field distribution for each simulation?

The user can still add monitors to the simulation if they want to go in later and inspect the data. It just might be a couple steps of grabbing the task id, loading the data, and then plotting it.

Pandas dataframe doesn't handle higher dimensional data very well I would assume.

Higher dimensional is ok, but where it's really best is for doing statistics / aggregation on this higher dimensional data and plotting it. Here are a bunch of examples of what you can do for visualization: https://pandas.pydata.org/docs/user_guide/visualization.html

In general, I think the tool is designed more right now to allow users to calculate and store the "post processed" data, but it's still not a great tool for eg. plotting batches of regular SimulationData. If that makes sense.

It would be good to improve both the batch plotting and the accessing of SimulationData in future versions

tylerflex avatar Jan 05 '24 17:01 tylerflex

Assigned @tylerflex just so this doesn't go hanging.

momchil-flex avatar Jan 05 '24 18:01 momchil-flex

@momchil-flex I know if we need / want to have everything assigned. What makes some sense to me is we spend some time every few weeks (or after every release) to go through and just assign out the highest priority items to the people who make the most sense and have bandwidth for them, rather than pre-assign every small issue. This way there are some issues floating around unassigned that various people can pick up if they have time rather than them getting stuck on a single person. What do you think?

tylerflex avatar Jan 05 '24 18:01 tylerflex

also, regarding this issue, we might want to just close it until further notice as I think the design plugin probably covers the use case.

tylerflex avatar Jan 05 '24 18:01 tylerflex

@momchil-flex I know if we need / want to have everything assigned. What makes some sense to me is we spend some time every few weeks (or after every release) to go through and just assign out the highest priority items to the people who make the most sense and have bandwidth for them, rather than pre-assign every small issue. This way there are some issues floating around unassigned that various people can pick up if they have time rather than them getting stuck on a single person. What do you think?

I think that could work if we set up a way to ensure we are regularly doing it. So far we've had cases where small things that should be done because they're fairly fast even if not very high importance were left for quite a while, so this is what I'm trying to mitigate by at least assigning someone.

momchil-flex avatar Jan 05 '24 18:01 momchil-flex

I think the "design" will not cover all the use cases, and it will still be convenient to be able to attach a small description or simple parameters for individual simulations, for design works not substantial/systematic enough to use "design" plugin, or for situations of manual iterations (basically tuning one or two parameters and running simulations one by one). @tylerflex

xin-flex avatar Jan 10 '24 19:01 xin-flex

If we have a metadata that can be a dict, we could put the jax_info there for adjoint simulations. That way we won't have to upload it as a separate file, which would be cleaner in several ways (including containing everything in one simulation file). Alternatively having an optional jax_info field could do the same job, but maybe it's fine to be in the metadata if it's introduced in a way that allows for it?

@tylerflex I think we should bump the priority on this? It has been requested by several users.

momchil-flex avatar Feb 22 '24 19:02 momchil-flex

yea I think we can do this for 2.7 I was also thinking it would be nice to have a metadata field for my adjoint refactor I want to do for 2.7. I can take care of it.

tylerflex avatar Feb 22 '24 19:02 tylerflex