evaluate() doesn't work with single table metadata
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: 0.15.0
- Python version: 3.7.13
- Operating System: Colab
Error Description
I get an error when running the following code with any metadata using metadata= . Runs fine if I comment out this line.
Steps to reproduce
from sdv.metrics.demos import load_single_table_demo from sdv.evaluation import evaluate
real_data, synthetic_data, metadata = load_single_table_demo()
evaluate(synthetic_data = synthetic_data, real_data = real_data, metadata=metadata, aggregate=False)
Error Snippet:
/usr/local/lib/python3.7/dist-packages/sdv/metadata/dataset.py in _dict_metadata(metadata)
144 """
145 new_metadata = copy.deepcopy(metadata)
--> 146 tables = new_metadata['tables']
147 if isinstance(tables, dict):
148 new_metadata['tables'] = {
KeyError: 'tables'
Hi @VasudevTA, thanks for filing this issue. I am able to replicate it. The issue only appears for single-table, so I'll change the title to reflect that.
Cause: It seems as if evaluate is expecting the metadata to be multi-table metadata. If you provide multi-table metadata, then you must also provide multi-table data.
Workaround: While we fix this, you can evaluate your single table by representing both the metadata and the data as multi-tables. The following works for me.
from sdv.metrics.demos import load_single_table_demo
from sdv.evaluation import evaluate
real_data, synthetic_data, metadata = load_single_table_demo()
# represent the metadata as multi-table
multi_table_metadata = {
'tables': {
'my_table_name': metadata
}
}
# represent the data as multi table
multi_table_real = { 'my_table_name': real_data }
multi_table_synthetic = { 'my_table_name': synthetic_data }
# then the following works
evaluate(
real_data=multi_table_real,
synthetic_data=multi_table_synthetic,
metadata=multi_table_metadata,
aggregate=False)
Let me know you are having problems with the workaround. In the meantime, we'll keep this issue open and use it to track updates for the overall bug.
Hi Neha - Many Thanks for the workaround!
This works. But I realized that evaluate now applies multi-table metrics. This is a subset of single table metrics - 29 vs 9.
The user guide reports:
Multi Single Table Metrics: These metrics simply a Single Table Metric on each table in the dataset and report the average score obtained.
Is there a way to include other single table metrics in the evaluate output?
@VasudevTA If you use the multi-table approach, this won't be possible.
Another Workaround: Use the single table data without providing the metadata.
from sdv.metrics.demos import load_single_table_demo
from sdv.evaluation import evaluate
real_data, synthetic_data, _ = load_single_table_demo()
evaluate(
real_data = real_data,
synthetic_data = synthetic_data,
aggregate=False)
Caution: If you omit the metadata, you'd have to ensure real and synthetic data tables are properly cleaned up.
- Remove any keys/IDs (primary keys, foreign keys, etc.)
- Make sure that categorical columns are represented as strings
- Make sure that numerical values are showing as ints or floats
@npatki
Hi Neha , Thanks again!
The workaround you mentioned was the first thing I tried. Unfortunately, without the metadata, most of the tests don't run. For example, the 8 tests in Machine Learning Efficacy Metrics need a target column. As far as I can see, for evaluate this can be only provided through metadata. Is there any other way to indicate a column being the target in evaluate?
I know that for the individual tests, we can provide a target through the target= eg:
MulticlassDecisionTreeClassifier.compute(real_data, synthetic_data, target='mba_spec')
But then I would have to run each of the metrics separately.
You are correct. Unfortunately, this cannot easily be remedied until we resolve this bug.
Yes it looks like. Will look forward to the release of the next version. Thanks!
Hi
I believe part of the problem is in sdv/evaluation, line 129, where there is an if for single_table. There it is assumed that real and synth data are dicts.
Hi @haaksb, nice catch! However, removing those lines doesn't seem to fix the issue for me. The bug seems more complex than that; we'll have to think through the overall interactions with the metadata object
While we may not be able to fix this bug in the following release, the SDV team is actively working on improving metadata and the overall metrics experience. @VasudevTA we'll update this bug as we make progress. In the meant time, do feel free to run through the SDMetrics separately, although I realize it's not the most elegant workaround.
Hi,
I had the same problem as described above, and I thought I had solved it by providing a table_name kwarg to evaluate() like in:
metadata = {'tables': {'data': tablemeta}} # tablemeta being the metadata of the single table I am using
evaluate(
real_data = realData,
synthetic_data = syntheticData,
metrics = { 'CSTest', 'KSTest', 'BNLogLikelihood', 'GMLogLikelihood', 'LogisticDetection', 'SVCDetection' },
metadata = metadata,
table_name = 'data',
aggregate = False )```
I assume this works since it does produce sensible output.
Hi @garrgravarr, yes you can use the table_name parameter to run metrics on a particular table. However it only allows you to input metrics defined on the multi-table level.
For example, the MulticlassDecisionTreeClassifier is only available for single table. When I include it in the parameters, it gives me an error.
evaluate(
real_data=multi_table_real,
synthetic_data=multi_table_synthetic,
metadata=multi_table_metadata,
table_name='my_table_name',
metrics=['MulticlassDecisionTreeClassifier'],
aggregate=False)
[/usr/local/lib/python3.7/dist-packages/sdv/evaluation.py](https://localhost:8080/#) in _select_metrics(synthetic_data, metrics)
89 final_metrics[metric] = metric_classes[metric]
90 except KeyError:
---> 91 raise ValueError(f'Unknown {modality} metric: {metric}')
92
93 return final_metrics, modality
ValueError: Unknown multi-table metric: MulticlassDecisionTreeClassifier
Great news everyone! We now have a new SDV 1.0 (Beta!) release that allows you to evaluate single table models. We have also integrated with the SDMetrics Quality Report and other visualization functions.
For more information see the new, Single Table Evaluation docs.