SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Better data validation message for `auto_assign_transformers`

Open npatki opened this issue 2 years ago • 1 comments

Problem Description

If I use the auto_assign_transformers functionality with invalid data*, then I receive an error that doesn't really make sense.

*Invalid data is any data that does not match the metadata

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
import numpy as np
import pandas as pd

metadata = SingleTableMetadata.load_from_dict({
    'columns': {
        'a': { 'sdtype': 'categorical' },
    }
})

synthesizer = GaussianCopulaSynthesizer(metadata)

# input data that does not match the metadata
data = pd.DataFrame({'b': list(np.random.choice(['M', 'F'], size=10)) })
synthesizer.auto_assign_transformers(data)

Output:

AttributeError: 'NoneType' object has no attribute 'get'

Expected behavior

I expect an error that is more descriptive to the problem. We should re-use the error message from using fit on invalid data.

synthesizer.fit(data)
InvalidDataError: The provided data does not match the metadata:
The columns ['b'] are not present in the metadata.

The metadata columns ['a'] are not present in the data.

Additional context

It appears that fit (and fit_processed_data) are actually running a validation check between the data and metadata. It seems that the auto_assign_transformers method is NOT running the check.

Should we run the check in this method? If so, maybe the fit functions don't need it (since they internally call this method first).

npatki avatar Jul 18 '23 22:07 npatki

Potentially related to https://github.com/sdv-dev/SDV/issues/1883

srinify avatar Apr 02 '24 15:04 srinify