smartnoise-sdk icon indicating copy to clipboard operation
smartnoise-sdk copied to clipboard

Rationalize SDGym Compatibility

Open joshua-oss opened this issue 3 years ago • 1 comments

Issue #432 was masked by the fact that the train method on DPCTGAN and PATECTGAN takes an ordinal_columns and a categorical_columns parameter, but no continuous_columns. This is a confusing API, which callers could reasonably assume ignores continuous columns. We borrowed this API from sdv-dev/CTGAN (the method on sdv-dev/CTGAN is fit), which really isn't intended to be called independently. The primary way to use CTGAN in SDV is to use sdv.tabular.CTGAN, which is a wrapper around sdv-dev/CTGAN that provides a different method signature for fit. Our PytorchDPSynthesizer is the analogous wrapper, and implements a fit method with the same signature as sdv-dev/CTGAN rather than sdv.tabular.CTGAN.

The rationale for copying the method signature was to ensure compatibility with SDGym. However, it's not clear that is being achieved. The SDGym CTGAN takes a table_metadata in the constructor, which includes all the information about which columns are categorical, continuous, and ordinal; and the fit method takes only the Pandas dataframe. The interface for SDGym is to expose a fit which takes a dictionary of pandas data frames, and a JSON-based metadata file describing types (e.g. 'categorical', 'continuous') and relationships. Our main use case involves a single table with no relationships, and we support both numpy and pandas, so the SDGym fit is both too restrictive and too flexible.

DPSDGym does have a metadata format that can be passed in with information about column types. We should rationalize the design to ensure:

  1. If caller passes SDGym table_data or metadata to fit or fit_sample on any of our synthesizers, we should just work.
  2. Metadata from DPSDGym (different from SDGym) should also just work
  3. Callers not using either gym should be able to pass in a single table as np.ndarray or pandas dataframe, and simplified metadata that is easy to construct declaratively in a method call

joshua-oss avatar Nov 19 '21 05:11 joshua-oss

Notes from discussion with @lurosenb

There are 4 main areas of potential compatibility with SDV:

  • Synthesizer Interface
  • RDT: Reversible Data Transforms for preprocessing and postprocessing
  • SDMetrics: Utility metrics for synthetic data
  • SDGym: The benchmarking orchestration layer, which runs synthesizers and tests against the available metrics

Synthesizer Interface

All of our synthesizers will expose fit and sample, patterned after BaseTabularModel in SDV. The data parameter to fit will allow a single pandas dataframe, numpy array, or string. Note that this is a departure from DPSDGym, where the fit methods take a dictionary of tables.

In BaseTabularModel, the metadata needed for transforms is passed in to the constructor (as opposed to SDGym, where it is passed in to the fit methods). In either case, the set of reversible data transforms is automatically created for the caller, based on heuristics from the metadata, with some ability for callers to override default transforms. Most of the transforms in SDV are not compatible with differential privacy, so we cannot automate the transform selection in a compatible way. We will assume that the caller will handle transform and inverse transform. That is, the fit and sample will assume that all data have been transformed to the format required by the synthesizer. It is likely that callers may want to chain transforms, and it's beyond the scope of our synthesizers to generalize this in a metadata-driven way.

We will allow the caller to pass in a transformer loosely based on SDV's HyperTransformer to the synthesizer's constructor. This means that no column metadata is passed to the synthesizer, apart from any column metadata that might be used to initialize a transformer. We can provide some default transformers, including one initialized from some sort of metadata about columns, but these are the responsibility of the caller to initialize and pass in.

All synthesizers will be accessible through a factory, and the factory can expose instances of the same class with different hyper parameter settings tuned to common scenarios.

Not all synthesizers will accept all column types. For example, MWEM requires numerically coded categorical. Synthesizers should expose the permitted types of columns, to assist automation of benchmarking.

Each synthesizer may have unique hyper parameters, which are passed in as kwargs, as in BaseTabularModel.

Relational data are out of scope for our synthesizers.

RDT

Few of SDV's reversible data transforms are suitable for differential privacy, but we will use the same interface, so our differentially private reversible transforms should be compatible with SDV. We will support only a few transforms at first, though it would be ideal to make DP versions of many of the RDT transforms.

  • DP standard scaler (from diffprivlib)
  • Categorical to one-hot
  • Non-DP standard scaler if public min and max are available
  • DP-safe binning

In context of DP, the fit method of a transform may consume privacy budget each time it is called. And callers who are testing multiple synthesizers may want to share a fit across synthesizers. We need to have a privacy accountant on the RDTs, and the synthesizer constructor should allow the caller to specify if a transform has already been fit and should be re-used, or if it should spend budget.

Note that RDT has special columns such as identifier, faker, or label, and also supports constraints for rejection sampling of sampled rows. We should try to support these in a differentially private manner.

SDMetrics

All of the metrics exposed in SDMetrics are useful for assessing differentially private synthetic data, so we will aim for exact compatibility. We will have code samples showing how to use SDMetrics against output from our synthesizers, and our extra metrics, such as PMSE, will be compatible. Ideally, our additional metrics would be contributed back to SDMetrics.

SDGym

Integration with SDGym's benchmarking runner could be very difficult to achieve, since we can't achieve exact compatibility with BaseTabularModel or SingleTableBaseline. If we get close enough in the BaseTabular analogue, it might be possible for someone to create a wrapper for SDGym's SingleTableBaseline, but they would need to take extra precautions to avoid having the non-DP transforms be used.

At the same time, we don't want to create a parallel benchmarking product. We will provide code samples for automating synthesizer comparison, and decide later how to integrate with any more developed third-party benchmarking suites.

joshua-oss avatar Dec 23 '21 00:12 joshua-oss

Fixed in #490

joshua-oss avatar Oct 09 '22 00:10 joshua-oss