SDV
SDV copied to clipboard
Better data validation message for `auto_assign_transformers`
Problem Description
If I use the auto_assign_transformers functionality with invalid data*, then I receive an error that doesn't really make sense.
*Invalid data is any data that does not match the metadata
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata
import numpy as np
import pandas as pd
metadata = SingleTableMetadata.load_from_dict({
'columns': {
'a': { 'sdtype': 'categorical' },
}
})
synthesizer = GaussianCopulaSynthesizer(metadata)
# input data that does not match the metadata
data = pd.DataFrame({'b': list(np.random.choice(['M', 'F'], size=10)) })
synthesizer.auto_assign_transformers(data)
Output:
AttributeError: 'NoneType' object has no attribute 'get'
Expected behavior
I expect an error that is more descriptive to the problem. We should re-use the error message from using fit on invalid data.
synthesizer.fit(data)
InvalidDataError: The provided data does not match the metadata:
The columns ['b'] are not present in the metadata.
The metadata columns ['a'] are not present in the data.
Additional context
It appears that fit (and fit_processed_data) are actually running a validation check between the data and metadata. It seems that the auto_assign_transformers method is NOT running the check.
Should we run the check in this method? If so, maybe the fit functions don't need it (since they internally call this method first).
Potentially related to https://github.com/sdv-dev/SDV/issues/1883