SDGym rfc: replace synthesizer function with class

Description

SDGym's synthesizers all inherit from the Baseline class (or BaseSynthesizer class in previous versions). Users can provide custom synthesizer functions. The convenience inheritance is demonstrated throughout SDGym's code base and has all sort of other benefits. My suggestion would be to make the following changes:

All synthesizers should inherit from a synthesizer base class (Baseline)
All synthesizers should implement a separate fit and sample method

These changes provide consistency between SDGym's native and user-provided synthesizers and clear distinction between fit and sample logic, at nearly no cost:

def synthesizer_function(real_data: dict[str, pandas.DataFrame],
                         metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
    ...
    # do all necessary steps to learn from the real data
    # and produce new synthetic data that resembles it
    ...
    return synthetic_data

will become

from sdgym.synthesizers.base import Baseline


class MySynthesizer(Baseline):
    def fit(self, real_data: dict[str, pandas.DataFrame], metadata: sdv.Metadata) -> None:
        # ...
        # do all necessary steps to learn from the real data
        # ...
    
    def sample(self, n_samples: int) -> dict[str, pandas.DataFrame]:
        # and produce new synthetic data that resembles it
        return synthetic_data

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

The code that needs to be changed for this is minimal, however I wanted to make sure you see value in this point before drafting a PR.

Jun 01 '21 14:06 sbrugman

@csala Any thoughts on this proposal?

Jun 30 '21 10:06 sbrugman

@leix28 @katxiao would a PR be welcome?

Nov 03 '21 12:11 sbrugman

Hi @sbrugman I think that for now the exact change that you are proposing is not within the current SDGym roadmap, but some variation is:

My suggestion would be to make the following changes:

All synthesizers should inherit from a synthesizer base class (Baseline)

All synthesizers should implement a separate fit and sample method

To add some more context to it, the reason for which the required input is a function instead of a class is that wrapping a class-based synthesizer that follows the fit/sample abstraction within a single function that receives real data, runs fit/sample internally and returns a synthetic clone is far easier than the opposite approach of trying to adapt a functional based synthesizer into this fit/sample abstraction. Also, the current implementation of SDGym already supports class-based synthesizers that inherit from the Baseline class, so making this a hard requirement does not really expand the support, it only narrows it.

On the other hand, it is true that this support for class-based synthesizer is not really documented, so adding it to the docs would be interesting!

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

This is another story, and it could actually be interesting too! An option that would be acceptable without sacrificing the functional input, would be to modify the code to capture such metrics only when the provided synthesizer is a Baseline subclass. We could make it so that model_time stays the same and is always reported for all synthesizer, but for Baseline subclasses two new columns, fit_time and sample_time, are added to the output table.

Nov 03 '21 13:11 csala

Hi Carles, thanks for getting back at this. The clarification on why you would not like to impose the fit/sample abstraction on users is helpful. The backwards-compatibility argument also makes sense. Good to know that we can move forward on by documenting the class-based synthesizers and with the conditional extension of the benchmark with metrics on whether the implementation is Baseline-based or otherwise.

Nov 03 '21 15:11 sbrugman