SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Fixed combinations Constraint

Open Pavan-Kalyan1432 opened this issue 1 year ago • 3 comments

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

  • SDV version:
  • Python version:
  • Operating System:

Problem description

What I already tried

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer, CTGANSynthesizer, TVAESynthesizer
import pandas as pd
import os

real_data = pd.read_csv('data//BILLING.csv').fillna("")
real_data = real_data.dropna(axis=1, how='all')
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=real_data)
metadata.update_columns_metadata(
    {
        "First Name":{"sdtype":"categorical"},
        "Last Name":{"sdtype":"categorical"},
        "Middle Name":{"sdtype":"categorical"},
        "Full Name":{"sdtype":"categorical"},
        "Date of Birth":{"sdtype":"date"},
        "National ID":{"sdtype":"categorical"}
    }
)

metadata.update_column("Phone Number", pii=False)

metadata.remove_primary_key()

path = 'output//metadata.json'
if os.path.exists(path):
    os.remove(path)
metadata.save_to_json(path)

my_constraint = {
    'constraint_class' : "FixedCombinations",
    'constraint_parameters' : {
        'column_names' : ['First Name', 'Middle Name', 'Last Name', 'Full Name']
    }
}

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.add_constraints(constraints=[my_constraint])
synthesizer.fit(real_data)

for col in real_data.columns:
    null_count = real_data[col].isnull().sum()
    empty_string_count = (real_data[col] == "").sum()
    total_nulls = null_count + empty_string_count
    total_cells = real_data.shape[0]  
    null_percentage = (total_nulls / total_cells) * 100 if total_cells > 0 else 0
    null_percent = null_percentage.round(2)
    print(f"{col} - {null_percent}%")

s = []

while True:
    column = input("Enter the column name to fix (or 'exit' to stop): ")
    if column == "exit":
        break
    if column not in real_data.columns:
        print("Column not found")
        continue
    s.append(column)

if s:
    fixed_columns = real_data[s]
    synthetic_data = synthesizer.sample_remaining_columns(fixed_columns, max_tries_per_batch=200)
else:
    synthetic_data = synthesizer.sample(num_rows=50)

synthetic_data.to_csv('output//synthetic_data_1.csv', index=False)

Here Fixed combinations is repeating the combinations but it is not considering all the combinations... What to do to make it consider all the combinations of first name, middle name, last name and full name of the real data

Pavan-Kalyan1432 avatar Oct 04 '24 07:10 Pavan-Kalyan1432

Hi @Pavan-Kalyan1432 can you clarify what you mean by "repeating the combinations but it is not considering all the combinations"?

  • Is the synthesizer re-using the same combinations of values from your real data?
  • Is it only re-using some of the combinations?

When generating synthetic data, using this constraint will ensure that the synthesizer will only use the same combinations of values in these 4 columns that exist in your real data. So, for example, if you only have rows containing the combination: "Jack", "John", "Jay", and "Jack John Jay" for your 4 columns, then this will be the only combination that will show up in the synthetic data.

srinify avatar Oct 08 '24 00:10 srinify

For example it is repeating the same combination multiple times and also it is not considering all the combinations that are in real data

Pavan-Kalyan1432 avatar Oct 08 '24 06:10 Pavan-Kalyan1432

Hi @Pavan-Kalyan1432, if I may jump in here: The purpose of the FixedCombinations constraint is only to fix the combinations that are created. Adding this constraint will prevent new permutations from being synthesized in the columns you specify.

If you sample many many more times, then I think due to random chance, you will eventually end up creating all the combinations that were in the original data.

However, preventing repetition is not the purpose of this constraint. May I ask why you want to prevent the repetition in your data? This indicates to me that in your synthetic data, you just want the same exact same names to appear in the exact same rows as your real data. Is that correct? If you could provide more information on your usage (what are you trying to accomplish with synthetic data), we can better guide you to a solution. Thanks.

npatki avatar Oct 08 '24 14:10 npatki

Hi @Pavan-Kalyan1432 we hope our responses cleared things up! Since we haven't heard from you in a while, I'm going to move forward with closing this issue out. Please don't hesitate to open a new issue or ask in our Slack for new questions!

srinify avatar Oct 23 '24 20:10 srinify

How to manage inter column dependency... For example we have 3 columns date of birth, date of death and age... In the synthetic data it is not coming properly. Give me the answer for both single table and multi table

Pavan-Kalyan1432 avatar Dec 06 '24 06:12 Pavan-Kalyan1432

Hi @Pavan-Kalyan1432, the original issue you filed was for FixedCombinations for first name and last name. Are you still having problems with this?

Your most recent question is for a different topic so I have filed a new issue here: https://github.com/sdv-dev/SDV/issues/2318

We can continue discussion about your inter-column dependency (birth, date of death, and age) in the new issue.

npatki avatar Dec 10 '24 16:12 npatki