SDV icon indicating copy to clipboard operation
SDV copied to clipboard

I want the ID column length should match the given regex pattern

Open Veeresh1996 opened this issue 1 year ago • 4 comments

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

  • SDV version: 1.14.0
  • Python version: 3.12.3
  • Operating System: Windows 11

Problem description

I am using HMAsynthesizer for Multitables. I am able to generate data with the trained model. But for the columns which I have mentioned as ID's the length of the generated values not matches with the real data even though I have specified the regex pattern. For example, One of the ID column contains 6 digits but the generated output contains some random lengths. Real Data ID Value: 300164 Generated value: 2690

What I already tried

This is the metadata for that specific field, "patnum": { "sdtype": "id", "regex_format": "^\d{6}$" } Could you please look into it ASAP? Please let me know if you need any other info

Thanks in advance

Veeresh1996 avatar Sep 17 '24 13:09 Veeresh1996

Hi @Veeresh1996, SDV is designed to ensure that the synthetic data matches (a) the regex format that you provide and (b) the original data type of the real data. In your case, it seems like the two are in conflict with each other: The regex describes having a 6-digit strings, but it appears to me the original data type is an integer.

The regex may correctly produce strings such as "002690" but when converted to an integer, this will become 2690 (no longer 6 characters). So the regex is not really compatible with the data type. To fix this issue, you would have to address root cause of the mismatch.

  • You could either update the regex format to ensure that leading 0s are not possible. Eg. Enforce that the first digit cannot be 0: [1-9]\d{5} or
  • Convert the real data to strings before passing it into SDV so that the synthetic data output will also be strings.

npatki avatar Sep 17 '24 14:09 npatki

Hey Neha, Thanks the solution that you have provided works for me.

  1. Is it possible to generate duplicate values in id columns? For example I want a six digit value and it is ok to have duplications of the value in same field.
  2. I have null values in one of my id column (which is not a primary key or foreign key but just a unique value), I just want to generate same kind of data with unique and null values in the respective field. How can I achieve that?

Veeresh1996 avatar Sep 17 '24 17:09 Veeresh1996

Hi @Veeresh1996 I work with Neha, hope you don't mind me stepping in here.

  1. The purpose of an ID column is to uniquely identify rows. In the multi-table context, SDV:
  • will generate a unique ID value for every row that is both set to the id sdtype and also set as the primary key
  • could generate multiple rows with the same ID value in the child table, because it's the foreign key and the cardinality is trying to be mirrored from the real data

At the moment, we don't support duplicate values for ID columns -- the only duplicates that will occur is when the ID column is a foreign key column, where duplicate values are a side-effect of a one-to-many relationship.

  1. SDV will learn the null values in your regular ID columns and try to recreate that ratio in the synthetic data as well.

Here's a quick code snippet you can run with public dataset to see this for yourself!

from sdv.datasets.demo import download_demo
from sdv.multi_table import HMASynthesizer
import numpy as np

data, metadata = download_demo(
    modality='multi_table',
    dataset_name='fake_hotels'
)

# Non-FK and non-PK column set to ID
metadata.update_column(table_name='hotels', column_name='classification', sdtype='id')

# Adding a single NaN to this ID column
data['hotels'].loc[4, 'classification'] = np.nan

synthesizer = HMASynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample()
synthetic_data['hotels']['classification'].isnull().value_counts()

The final line of code there will return False: 9, True: 1, which corresponds to 1 NaN value in the synthetic data for that column as well.

srinify avatar Sep 25 '24 20:09 srinify

Hi @srinify and @Veeresh1996, just to clarify something on point 1: When you supply a column as an id, it just means that the column represents a label that can help you identify a concept. It does not always have to be unique. (This blog post has some useful information.)

As an example, consider three different id columns:

A. A primary key column might be sdtype id B. A foreign key column might be sdtype id C. You may have a generic column that is sdtype id (neither a primary nor foreign key)

In this case, SDV will only enforce that A is unique -- as primary keys must uniquely distinguish every row. SDV will allow B and C to repeat. Just be aware that for C, depending on the regex you provide, it you may need to sample a lot of data in order to see the duplicates.

@Veeresh1996 Would you like your ID values (with regex) to repeat at higher frequency? This is definitely a feature request that we can track. We prioritize based on how important it is to your use case, so we would appreciate any more info you are able to provide.

npatki avatar Sep 26 '24 18:09 npatki

Hi @Veeresh1996 it's been a while since we've had some activity here on this issue so I'm going to go ahead and close it out. If you have more questions or issues, please feel free to open new GitHub issues :)

srinify avatar Oct 23 '24 20:10 srinify