smartnoise-sdk
smartnoise-sdk copied to clipboard
Creating Metadata yml files
Hi, awesome work, thank you so much!
I was wondering if there are any helper functions to generate a basic metadata file from a csv (or any other connection such as a pandas dataframe or connected sql dataset)?
@joshua-oss Can you please help with this?
Hi @snwagh, we don't currently have a helper function to do that, but it's a good idea.
We could expose this capability as an API to support any connection, and as a command-line utility to point at CSV files. I'll add it to the next iteration.
Note that we would have to spend some privacy budget to generate the Metadata, since we would need to assume that the min and max values are private, and would have to estimate them in a differentially private way. We will add some documentation about how to manage that budget in cases where the bounds are public information (e.g. in PUMS, age and education have public bounds).
Thanks @joshua-oss!
I was actually thinking of an even simpler utility. In my understanding of the broader use-case, the Metadata generation need not have privacy since it will be a command-line utility used by the "data owner" right? So the data owner will provide dataset and the Metadata file and an analyst can use SNSQL to run DP queries on the dataset (and the provided Metadata)?
Maybe my understanding of the Metadata class (and all it does) is limited.
You are correct that the data owner/curator will be the one to supply the metadata, and this needs to be done only once. The metadata will be visible to the analyst, so the bounds used should be considered "public". The primary challenge is that the minimum and maximum values present in the data are often not the safe public bounds. For example, it's common for people to bound an "age" column between 0 and 100, but the actual values might be between 15 and 79. As a data curator, you would probably want to use bounds like 18-80 or 0-100 or something like that, because there is a good chance that leaking either 15 or 79 could be used by an adversary to infer something about the members of your data.
Ideally, the curator should know the safe public bounds without needing to look at the data. For example, with a column like age, the database schema may already have a constraint specifying the min and max, and any input forms on the associated UX should also implement these bounds. The key here is that the actual data might be different from the hard coded bounds, and revealing the actual min and max values would compromise privacy.
We could conceivably use the SQL Alchemy APIs to get at the DDL for tables to see if there are any bounds constraints, and use those. But an easier approach is to just spend a small amount of privacy budget to infer approximate bounds [1] from the data. The data curator could choose how much budget to spend (and it's a one-time cost, since you don't need to re-create the metadata every time). The data curator could of course override the approximate bounds and enter known public bounds (e.g. 18-80) as well, in which case the privacy spend for that column could be ignored.
[1] https://github.com/opendp/smartnoise-sdk/blob/synth_factory/synth/snsynth/transform/mechanism.py#L40
Thanks a lot @joshua-oss!