smartnoise-sdk
smartnoise-sdk copied to clipboard
Synthesizer Factory and Interface
Thread for synth factory design brainstorm and discussion.
Goals
Current design requires callers to know concrete synthesizer class name, import appropriate classes, and in some cases call with a wrapper. We would like to simplify allow any synthesizer to be created from the same basic imports. Design should support easy addition of new synthesizers.
Creation
Some options described by idealized code samples:
- Synthesizer with default hyperparameters
from snsynth import Synthesizer
synth = Synthesizer.create("mwem")
synth.fit(df)
data = synth.sample(50)
- Override a hyperparameter
from snsynth import Synthesizer
synth = Synthesizer.create("patectgan", batch_size=50)
synth.fit(df)
data = synth.sample(50)
- We could allow multiple factory entries for the same concrete synthesizer, using different hyperparameters. For example, a PATECTGAN instance with hyperparameters optimized for smaller datasets might be created like:
from snsynth import Synthesizer
synth = Synthesizer.create("patectgan_small")
synth.fit(df)
data = synth.sample(50)
A drawback of using strings to switch is that callers will need to refer to the documentation to know what synthesizer configurations are available. This could be mitigated by providing introspection on the factory allowing enumeration and description of the synthesizers:
from snsynth import Synthesizer
for synth_name, description in Synthesizer.synthesizers:
print(f"{synth_name}: {description})
The synthesizer list would be visible from the debug watch in the IDE, but would not populate options automatically via intellisense.
Alternately, we could switch based on an enum:
from snsynth import Synthesizer, Synth
synth = Synthesizer.create(Synth.mwem)
synth.fit(df)
data = synth.sample(50)
The allowed enum values will populate in the IDE automatically, self-documenting via intellisense, and reducing the odds that the caller will ask for a synthesizer that doesn't exist. However, this adds an extra import, and feels repetitive in the code.
Is there a better way to key the factory??
Interface
In the factory examples above, the factory method is a static method on the base class that also defines the interface for all synthesizers. All concrete instances of the Synthesizer
base class should have at least two methods and a property:
-
fit()
-- learns the distribution of a non-private dataset -
sample()
-- draws synthesized records based on the learned distribution -
expects
-- property on the synthesizer that describes what type of inputs the synthesizer expects or requires. For example, GAN synthesizers need categories to be one-hot encoded, and can support continuous values. SMT expects all data to be categorical, and expects categories to be encoded as integers.
The format (and name) of expects
is out of scope for this discussion, and will be covered elsewhere. The interface of sample()
takes an optional number of rows, and ideally nothing else. The interface for fit()
requires some discussion.
At a minimum, fit
needs to accept a dataset. We will accept a few different types of datasets (pandas, numpy, list of tuples), and all types of datasets should be supported by all synthesizers. Concrete implementations can be designed to support just one or two canonical dataset types, and the code in the base class (accessed by super()
) can handle the appropriate conversion.
Note that this is only referring to the type of dataset (e.g. pandas vs. numpy array). For example, suppose that the caller has a pandas dataframe where all data are integer-coded categorical, as expected by MWEM, but MWEM operates on a numpy array. The caller could manually convert to numpy first, or could simply pass in the pandas dataframe like:
from snsynth import Synthesizer
synth = Synthesizer.create("mwem")
synth.fit(df)
MWEM already handles this scenario, but the goal here is that the concrete implementation for the necessary conversion exists on the Synthesizer
base class rather than MWEM
, and it can be shared across all synthesizers. Developers should be able to add a new concrete synthesizer by specifying what input dataset type is expected, and trust that the appropriate dataset type will always be available in the call to fit method, regardless of what the caller passed in.
Presumably, the caller will want the synthesized rows (from sample()
) to be of the same dataset type. So the fit()
method needs to stash some state telling what the input dataset type was, and then the sample()
method would implement the conversion if necessary, preferably delegating the conversion to the base class again.
Is this really what the caller expects? Do we need a way to override on sample (e.g. caller passes pandas to fit
, and wants numpy from sample
)?
Column Transforms
In addition to transparently handling different dataset types (e.g. numpy, pandas), we want to transparently handle different input transforms. For example, automatically binning continuous values for an SMT-style synthesizer, or one-hot encoding categorical data for a GAN synthesizer. We will discuss the design of the transformers with idealized code samples in another issue. For purposes of the Synthesizer
interface discussion, there are some key questions:
-
There needs to be glue in the synthesizer's
fit
method which first transforms the input data, passes the input data through thefit
call, and stashes any state necessary to reverses the transform when sampling. How much of this glue code can be delegated to the base class? Ideally, no concrete synthesizer implementation would need to know anything about transforms. -
How much of the transform glue code can be automatically inferred, so the caller doesn't need to specify any transforms?
-
In cases where the caller needs to provide hints about the transform (e.g. provide a public min/max for a non-DP standard scaler, or provide an epsilon to be used when creating a DP standard scaler), how does the caller pass in the required information?
The interface for transforms (discussed in separate issue) will provide some declarative way to construct concrete transforms, so we could imagine something like this:
from snsynth import Synthesizer, RangeTransform
synth = Synthesizer.create("mwem")
synth.fit(df, transforms=[RangeTransform(from=0, to=99, bins=10), RangeTransform(epsilon=0.1, bins=30)])
This could also take a single TableTransformer
, or could be a dictionary keyed by column name or number, or some other design. It could even allow a declarative format like YAML. We will probably support more than one syntax, with a goal to make this as easy as possible. The point of this example isn't to suggest a syntax, but instead to indicate that the caller can provide necessary hints for the transforms. Assuming that we have some way to tell the fit()
method how to transform, the appropriate transforms should be automatically handled by implementation code on the Synthesizer
base class.
Some transforms will spend privacy budget to estimate some parameters derived from the data. For example, a min/max transform when public bounds are not known, will need to use privacy budget to estimate min and max. These estimated values need to be stashed in the synthesizer so that they can be used when reversing the data in sample()
.
As described above, each subsequent call to fit (perhaps processing multiple disjoint partitions of a large dataset) will spend new privacy budget to estimate the parameters such as min/max again. This seems like the right choice, but there may be important scenarios where someone would want the DP estimate to be used across all subsequent calls to fit()
This is awesome Joshua!
- Here's another way we might want to architect the factory. I think that this is nice because we open up two patterns without explicit imports for the users. I have a sample implementation of the class structure below
class Synthesizer():
def info(self):
raise NotImplementedError
def __init__(self):
pass
def all_synths(self):
return Synthesizer.__subclasses__()
@classmethod
def create(self, synth='SynthesizerA', args=None):
if isinstance(synth, str):
return _SYNTHS_MAPPING[synth].create()
elif synth in Synthesizer.__subclasses__():
return synth.create()
else:
raise ValueError('The "synth" argument must be a supported \
synthesizer object or string class name')
class SynthesizerA(Synthesizer):
@classmethod
def info(self):
return 'This is a SynthesizerA'
@classmethod
def create(self):
pass
class SynthesizerB(Synthesizer):
@classmethod
def info(self):
return 'This is a SynthesizerB'
@classmethod
def create(self):
pass
SYNTHS = [SynthesizerA, SynthesizerB]
_SYNTHS_MAPPING = {
'SynthesizerA': SynthesizerA,
'SynthesizerB': SynthesizerB
}
Which gives the following dual options to users. Kindof a nice middle ground, without necessary extra imports!
# user options one-a, one-b and one-c, create default, create with string or direct create if known class name
from snsynth import Synthesizer
Synthesizer.create()
Synthesizer.create('SynthesizerA')
Synthesizer.create(snsynth.SynthesizerA)
# user option two, examine potential synths, create one on the spot
import snsynth
for s in snsynth.SYNTHS:
print(s.info())
snsynth.SynthesizerA.create()
Would love your thoughts on this pattern...I think it is a nice middleground.
- I completely agree with the breakdown of requirements, and with moving the automatic type conversion up to the Synthesizer method (as opposed to leaving it to live on MWEM or another synthesizer)
I think that the behaviour should be pandas in -> pandas out, numpy in -> numpy out. No need to add an override, users can handle their own further conversions on their own : ) (and I think this is what they would expect)
- For each of the points/questions you raised:
- Different synthesizers will require different default transforms. This code can all live in the Synthesizer factory subclasses, if thats the route we decide to take with the architecting. I agree that we should try to move essentially all transforms outside of the actual method classes, which we can do, although we may still need synthesizer specific patterns, unless we can think of some sort of clever logic?
- Perhaps we should have a default transform pattern for each synthesizer (which we can represent as just an ordered list of transforms, as you suggested), and then the caller can replace that pattern with their own list if they desire. If they do that, we can raise a warning that custom lists of transforms need to be carefully checked for privacy leaks, or something like that...
- This is a great question. I like the pattern that you suggested here, although perhaps we should instantiate the actual classes within the factory, so that we can automatically check for privacy leaks (by summing any epsilon they specify for transforms like min/max, etc. that consume budget). Thus the transform pattern is an ordered list of tuples, where the tuples are like
[(Transform1, args={}), (Transform2, args={})...]
. I'm not sure
That's essentially how I was thinking of the factory. For example, suppose we have MWEMSynthesizer(Synthesizer)
, the following four lines of code should all produce the same object:
synth = Synthesizer.create("mwem", splits=4)
synth = MWEMSynthesizer(splits=4)
synth = Synthesizer.create(MWEMSynthesizer, splits=4)
synth = Synthesizer.create(MWEMSynthesizer(splits=4))
The first example would be the mainstream example, while the second is what would need to be implemented by the person writing the synthesizer. The final two examples would be "for free" because of boilerplate in the factory. The advantage of steering people to call the first method is that we could have overrides such that (e.g.) these two lines of code would be equivalent:
synth = Synthesizer.create("mwem_hi")
synth = MWEMSynthesizer(splits=<setting for high dimensions>)
Note that these examples have the string key being short and simple, since "Synthesizer" is an implementation detail and redundant. This is easy enough to handle in the skeleton implementation you posted. I think I would prefer to have the value part of the key->value in _SYNTHS_MAPPING
be a string rather than a reference to the actual class, so the factory could be used without importing all of the synthesizers first. However, the string could be the actual class name, so the class could be imported and instantiated as needed via reflection. Basically, it would work the same, but only import as needed.
The rationale for this is that production use will typically involve only one or two synthesizers. For example, a service typically will either always use PyTorch or never use PyTorch. Another service might use only a JAX-based synthesizer. Someone using only MWEM should not be required to have either PyTorch or JAX installed, and so on. To partially mitigate this, we should require the concrete implementations wait until instantiation to import any heavy dependencies that ought to be optional. If we could trust that concrete implementations will lazily import, then importing all of the classes up front seems doable, but still means the whole thing blows up if something goes wrong in the load of a synthesizer the analyst doesn't intend to use.
(Of course, we would also need to make some decisions about the dependencies in setup.py
, like how many of these heavy dependencies do we want to make optional/external. But the goal here is to make sure the factory can handle whatever we decide, since we are already in a situation where not all of the requirements can be solved in some reasonable production environments.)
Regarding ii, "we should have a default transform pattern for each synthesizer (which we can represent as just an ordered list of transforms)", note that the list of transforms in the second example was meant to have the same size as the number of columns, so the default transforms depend both both the input data and the synthesizer's expectations. So, when fit()
gets called without any transforms, the fit method (preferably by calling super()
with some metadata about what the synthesizer expects) would construct the appropriate default transforms list. Just clarifying that each synthesizer would need some way of saying "I need one-hot", or "I need categorical with max total dimensionality 1000" and so on. Then default transforms could be constructed by looking at the input dataset and figuring out how to change it to be what the synthesizer wants.
This is now implemented in v0.3.0.