SDV
SDV copied to clipboard
Add support for composite keys
Problem Description
Currently, SDV only supports individual column keys as primary keys.
We should extend this to support composite keys (multiple columns that uniquely identify each row when put together).
This should work for multi table models and for specifying table relationships based on composite keys.
I think that this has some overlap with #361 Maybe we could close the other one, and open another issue for the extra types
Any update on this feature? I suppose I could work around by concatenation, but even basic relational models rely on composite keys. Unfortunately, this renders SDV unusable with the current models I have.
Any news about this one?
Hi everyone, this is in our roadmap for upcoming features. We'll update this as we make more progress.
Workaround
In the meantime, a workaround is to create a pipeline of manually concatenating the composite key columns in both the parent and child tables.
# Create a column with the concatenated id
# Use a separator such as ','
parent_table['concatenated_id'] = parent_table[['id_0', 'id_1']].astype(str).agg(','.join, axis=1)
child_table['concatenated_id'] = child_table[['id_0', 'id_1']].astype(str).agg(','.join, axis=1)
# drop the original columns
parent_table.drop(['id_0', 'id_1'], axis=1, inplace=True)
child_table.drop(['id_0', 'id_1'], axis=1, inplace=True)
# now use this data with the HMA1 model
# Note: Be sure to update the metadata to reflect the new columns you've created & dropped
At the end, you can split them up again using the separator
synthetic_parent[['id_0', 'id_1']] = synthetic_parent['concatenated_id'].str.split(',', expand=True)
synthetic_child[['id_0', 'id_1']] = synthetic_child['concatenated_id'].str.split(',', expand=True)
Hi everyone, this is in our roadmap for upcoming features. We'll update this as we make more progress.
Workaround
In the meantime, a workaround is to create a pipeline of manually concatenating the composite key columns in both the parent and child tables.
# Create a column with the concatenated id # Use a separator such as ',' parent_table['concatenated_id'] = parent_table[['id_0', 'id_1']].astype(str).agg(','.join, axis=1) child_table['concatenated_id'] = child_table[['id_0', 'id_1']].astype(str).agg(','.join, axis=1) # drop the original columns parent_table.drop(['id_0', 'id_1'], axis=1, inplace=True) child_table.drop(['id_0', 'id_1'], axis=1, inplace=True) # now use this data with the HMA1 model # Note: Be sure to update the metadata to reflect the new columns you've created & dropped
At the end, you can split them up again using the separator
synthetic_parent[['id_0', 'id_1']] = synthetic_parent['concatenated_id'].str.split(',', expand=True) synthetic_child[['id_0', 'id_1']] = synthetic_child['concatenated_id'].str.split(',', expand=True)
Hi @npatki, what if I have metadata.json file that defines the creation of the tables and the relative columns? I guess that your code is for a metadata creation in a python file
Hi @montasIET, my code modifies the actual data to concatenate the composite key into a single key. After doing it, you would have to modify the metadata with the name of the new, concatenated column (and drop the original columns).
Eg. If the keys are named 'id_0'
and 'id_1'
, and you concatenate them under the name 'concatenated_id'
then you'd modify this in the metadata too:
'tables': {
'TABLE_NAME': {
'primary_key': 'concatenated_id',
'fields': {
'concatenated_id': { 'type': 'id', 'subtype': 'string', 'regex': '[0-9]{3},[0-9]{3}' }
}
}
}
You would do the same to a foreign key that refers to that primary key.
Hi @montasIET, my code modifies the actual data to concatenate the composite key into a single key. After doing it, you would have to modify the metadata with the name of the new, concatenated column (and drop the original columns).
Eg. If the keys are named
'id_0'
and'id_1'
, and you concatenate them under the name'concatenated_id'
then you'd modify this in the metadata too:'tables': { 'TABLE_NAME': { 'primary_key': 'concatenated_id', 'fields': { 'concatenated_id': { 'type': 'id', 'subtype': 'string', 'regex': '[0-9]{3},[0-9]{3}' } } } }
You would do the same to a foreign key that refers to that primary key.
ok now it's clear. I'll try and I'll let you know, thanks!
Hi @npatki, I tried your solution and it worked, but not with the desired effect. I mean that the primary key of the main table and the referred key of the secondary table are generated consecutively, starting from 0000,00000 and so on according to how much num_rows I set. The model that I use is HMA1. Is there a way to sample a concatenated id that is learnt from original fields in the fit method?
eg. original data: (2019, 23172), (2019, 67190) current generated data: (0000, 00000) what I want: (four digits similar to the original, five digits similar to the original)
P.S. other columns that are not keys are generated in a proper way.
I got the same generated generated primary keys 000,000, 000,001 and so on using HMA1. Is there a way around this? Thanks!
Hi @npatki any news? I'm still stucked on this problem
Hi @Is-MMaker and @montasIET, are you still having problems with this?
We recently released the new SDV 1.0 library. And while we haven't yet added composite key support, it is now a simpler API.
what I want: (four digits similar to the original, five digits similar to the original)
I'm not exactly sure what the desired effect is here. What does "similar to the original" mean? Note that for synthetic data, we cannot simply repeat the same primary keys as the original data. The reason is that you may want to produce synthetic data that is 2x or 3x the original size -- so we would run out of primary keys.
So the main way this works is if you provide a Regex so the SDV knows how to generate the values. With SDV 1.0 and the composite key workaround above, you could include a stricter regex if that's what you need.
regex_format: "2[0-9]{3},[0-9]{5}"
Is it a problem that the values are generated incrementally? If so, I'd suggest filing a new issue for tracking. It would be nice to keep this issue about composite key support.