SDV
SDV copied to clipboard
[Enterprise Usage] Cannot pass in domestic phone numbers
Enterprise users have access to additional RDTs that offer them features such as contextual anonymization.
Problem Description
For phone_number data, the SDV automatically assigns the AnonymizedGeoExtractor that can parse out phone numbers. If my phone numbers are domestic, it means they do not have an international country code. The transformer expects the default_country code to be provided in this case.
For example, I may have (617) 253-3400 which is a US domestic phone number. So it expects US as the default country.
What happens today
Today the data processor will assign a transformer without default_country attached, so there is an error if I pass in domestic phone numbers.
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
data = pd.DataFrame(data={
'id': [0, 1],
'age': [29, 45],
'domestic_numbers': ['(617) 253-3400', '(617) 495-1000']
})
metadata = SingleTableMetadata.load_from_dict({
'primary_key': 'id',
'columns': {
'id': { 'sdtype': 'id' },
'age': { 'sdtype': 'numerical' },
'domestic_numbers': { 'sdtype': 'phone_number' }
}
})
synth = GaussianCopulaSynthesizer(metadata, locales=['en_US'])
synth.fit(data)
Output:
ValueError: Phone number (617) 253-3400 is represented in national format. Please provide ``default_country`` for nationally represented numbers when creating the transformer instance.
Expected behavior
If there is a single locale provided (in the locales parameter), then the data processor should:
- Parse out the country code. (This is everything after the underscore. For example,
en_USwould beUS.) - Assign phone number sdtype to an AnonymizedGeoExtractor with that country code as the
default_country.
synthesizer.get_transformers()
{
...,
'domestic_numbers': AnonymizedGeoExtractor(default_country='US')
}
Additional context
- If there are multiple locales, then do not pass in a default country. In this case, the phone numbers should have an international country code
- The AnonymizedGeoExtractor will check for both the default country and international numbers (as a fallback). So no need to worry about other cases.