SDV
SDV copied to clipboard
Allow keys to be used in constraints where relevant (eg. foreign key in Unique constraint)
Environment Details
- SDV version: 0.16.0
- Python version: 3.8
- Operating System: Ubuntu 20.04.4
Error Description
I tried to create relational data with 2 tables:
- a table of
sections
with their id, rank, and amount of elements in a section and - a table of
elements
with their id, which section they belong to, rank within the section, and type of element
For elements
I added a Unique constraint for the combination of the columns section and rank, so that the rank is unique per section.
However, now model.sample()
returns the error: UserWarning: Unique cannot be transformed because columns: ['section'] were not found. Using the reject sampling approach instead. on the line model.fit(data).
I do not receive any new data.
Steps to reproduce
I use the following code:
from sdv.metadata.dataset import Metadata
from sdv.relational import HMA1
md = Metadata("test-data/metadata-test-2.json")
data = md.load_tables()
model = HMA1(md)
model.fit(data)
new_data = model.sample()
An extract from elements-test-2.csv:
element_id,section,rank,type
1,58964,1,label
2,58964,2,forum
3,58967,2,page
4,58967,1,book
An extract from sections-test-2.csv:
section_id,rank,elements_amount
58964,1,2
58967,2,4
My metadata is as follows:
{
"tables": {
"sections": {
"fields": {
"section_id": { "type": "id", "subtype": "integer" },
"rank": {"type": "numerical", "subtype": "integer" },
"elements_amount": {"type": "numerical", "subtype": "integer" }
},
"path": "sections-test-2.csv",
"primary_key": "section_id",
"constraints": [
{
"constraint": "sdv.constraints.Unique",
"column_names": ["rank"]
}
]
},
"elements": {
"fields": {
"element_id": { "type": "id", "subtype": "integer" },
"rank": {"type": "numerical", "subtype": "integer" },
"type": { "type": "categorical" },
"section": {
"type": "id",
"subtype": "integer",
"ref": {
"table": "sections",
"field": "section_id"
}
}
},
"path": "elements-test-2.csv",
"primary_key": "element_id",
"constraints": [
{
"constraint": "sdv.constraints.Unique",
"column_names": ["rank", "section"]
}
]
}
}
}
The problematic part seems to be
"constraints": [
{
"constraint": "sdv.constraints.Unique",
"column_names": ["rank", "section"]
}
]
since the error is not thrown when I remove this part.
Explanation
Neha explained: "This is happening because you have a foreign key column involved in the Unique constraint. SDV treats primary/foreign keys in a separate layer so it is no longer “found” when it gets to the constraint stage. "
Workaround
I have found the following workaround:
I duplicated the column that was not found, so that I can use one of the identical columns as a Foreign Key and one for my Unique constraint. SDV still learns that the columns are identical and thus in the end I receive unique ranks per section.
Extract of my new elements table:
element_id,section,rank,type,section_alt
1,58964,1,label,58964
2,58964,2,forum,58964
3,58967,2,page,58967
4,58967,1,book,58967
My new metadata:
{
"tables": {
"sections": {
"fields": {
"section_id": { "type": "id", "subtype": "integer" },
"rank": {"type": "numerical", "subtype": "integer" },
"elements_amount": {"type": "numerical", "subtype": "integer" }
},
"path": "sections-test-2.csv",
"primary_key": "section_id",
"constraints": [
{
"constraint": "sdv.constraints.Unique",
"column_names": ["rank"]
}
]
},
"elements": {
"fields": {
"element_id": { "type": "id", "subtype": "integer" },
"rank": { "type": "numerical", "subtype": "integer" },
"type": { "type": "categorical" },
"section": {
"type": "id",
"subtype": "integer",
"ref": {
"table": "sections",
"field": "section_id"
}
},
"section_alt": { "type": "numerical", "subtype": "integer" }
},
"path": "elements-test-2.csv",
"primary_key": "element_id",
"constraints": [
{
"constraint": "sdv.constraints.Unique",
"column_names": ["rank", "section_alt"]
}
]
}
}
}
Suggestion
- A more descriptive error message
- Possibly internal handling of this case by SDV, without users needing to find a workaround
Thanks for filing @LiFaytheGoblin, we will investigate and report more info here.
For SDV developers: I think it's fine if such a constraint falls back to our reject sampling approach (instead of transform). It's strange that reject sampling is failing though. Perhaps we are doing it too early, before the foreign key is added back in?
Update: Seems like we explicitly do not support any keys (foreign or primary) in constraints at the moment.
I'll turn this into a feature request and update the title to reflect this.