SDV icon indicating copy to clipboard operation
SDV copied to clipboard

evaluate() doesn't work with single table metadata

Open VasudevTA opened this issue 3 years ago • 10 comments

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.15.0
  • Python version: 3.7.13
  • Operating System: Colab

Error Description

I get an error when running the following code with any metadata using metadata= . Runs fine if I comment out this line.

Steps to reproduce

from sdv.metrics.demos import load_single_table_demo from sdv.evaluation import evaluate

real_data, synthetic_data, metadata = load_single_table_demo()

evaluate(synthetic_data = synthetic_data, real_data = real_data, metadata=metadata, aggregate=False)

Error Snippet:
/usr/local/lib/python3.7/dist-packages/sdv/metadata/dataset.py in _dict_metadata(metadata)
    144         """
    145         new_metadata = copy.deepcopy(metadata)
--> 146         tables = new_metadata['tables']
    147         if isinstance(tables, dict):
    148             new_metadata['tables'] = {
KeyError: 'tables'

VasudevTA avatar Jun 21 '22 04:06 VasudevTA

Hi @VasudevTA, thanks for filing this issue. I am able to replicate it. The issue only appears for single-table, so I'll change the title to reflect that.

Cause: It seems as if evaluate is expecting the metadata to be multi-table metadata. If you provide multi-table metadata, then you must also provide multi-table data.

Workaround: While we fix this, you can evaluate your single table by representing both the metadata and the data as multi-tables. The following works for me.

from sdv.metrics.demos import load_single_table_demo
from sdv.evaluation import evaluate

real_data, synthetic_data, metadata = load_single_table_demo()

# represent the metadata as multi-table
multi_table_metadata = {
  'tables': {
    'my_table_name': metadata
  }
}

# represent the data as multi table
multi_table_real = { 'my_table_name': real_data }
multi_table_synthetic = { 'my_table_name': synthetic_data }

# then the following works
evaluate(
  real_data=multi_table_real,
  synthetic_data=multi_table_synthetic,
  metadata=multi_table_metadata,
  aggregate=False)

Let me know you are having problems with the workaround. In the meantime, we'll keep this issue open and use it to track updates for the overall bug.

npatki avatar Jun 21 '22 14:06 npatki

Hi Neha - Many Thanks for the workaround! This works. But I realized that evaluate now applies multi-table metrics. This is a subset of single table metrics - 29 vs 9.
The user guide reports:

Multi Single Table Metrics: These metrics simply a Single Table Metric on each table in the dataset and report the average score obtained.

Is there a way to include other single table metrics in the evaluate output?

VasudevTA avatar Jun 21 '22 15:06 VasudevTA

@VasudevTA If you use the multi-table approach, this won't be possible.

Another Workaround: Use the single table data without providing the metadata.

from sdv.metrics.demos import load_single_table_demo
from sdv.evaluation import evaluate

real_data, synthetic_data, _ = load_single_table_demo()

evaluate(
real_data = real_data,
synthetic_data = synthetic_data,
aggregate=False)

Caution: If you omit the metadata, you'd have to ensure real and synthetic data tables are properly cleaned up.

  • Remove any keys/IDs (primary keys, foreign keys, etc.)
  • Make sure that categorical columns are represented as strings
  • Make sure that numerical values are showing as ints or floats

npatki avatar Jun 22 '22 15:06 npatki

@npatki Hi Neha , Thanks again! The workaround you mentioned was the first thing I tried. Unfortunately, without the metadata, most of the tests don't run. For example, the 8 tests in Machine Learning Efficacy Metrics need a target column. As far as I can see, for evaluate this can be only provided through metadata. Is there any other way to indicate a column being the target in evaluate?

I know that for the individual tests, we can provide a target through the target= eg: MulticlassDecisionTreeClassifier.compute(real_data, synthetic_data, target='mba_spec') But then I would have to run each of the metrics separately.

VasudevTA avatar Jun 23 '22 18:06 VasudevTA

You are correct. Unfortunately, this cannot easily be remedied until we resolve this bug.

npatki avatar Jun 24 '22 01:06 npatki

Yes it looks like. Will look forward to the release of the next version. Thanks!

VasudevTA avatar Jun 24 '22 01:06 VasudevTA

Hi

I believe part of the problem is in sdv/evaluation, line 129, where there is an if for single_table. There it is assumed that real and synth data are dicts.

haaksb avatar Jun 29 '22 08:06 haaksb

Hi @haaksb, nice catch! However, removing those lines doesn't seem to fix the issue for me. The bug seems more complex than that; we'll have to think through the overall interactions with the metadata object

While we may not be able to fix this bug in the following release, the SDV team is actively working on improving metadata and the overall metrics experience. @VasudevTA we'll update this bug as we make progress. In the meant time, do feel free to run through the SDMetrics separately, although I realize it's not the most elegant workaround.

npatki avatar Jun 30 '22 21:06 npatki

Hi, I had the same problem as described above, and I thought I had solved it by providing a table_name kwarg to evaluate() like in:

metadata = {'tables': {'data': tablemeta}}  # tablemeta being the metadata of the single table I am using
evaluate(
  real_data = realData,
  synthetic_data = syntheticData,
  metrics = { 'CSTest', 'KSTest', 'BNLogLikelihood', 'GMLogLikelihood', 'LogisticDetection', 'SVCDetection' },
  metadata = metadata,
  table_name = 'data',
  aggregate = False )```

I assume this works since it does produce sensible output.

garrgravarr avatar Jul 15 '22 09:07 garrgravarr

Hi @garrgravarr, yes you can use the table_name parameter to run metrics on a particular table. However it only allows you to input metrics defined on the multi-table level.

For example, the MulticlassDecisionTreeClassifier is only available for single table. When I include it in the parameters, it gives me an error.

evaluate(
  real_data=multi_table_real,
  synthetic_data=multi_table_synthetic,
  metadata=multi_table_metadata,
  table_name='my_table_name',
  metrics=['MulticlassDecisionTreeClassifier'],
  aggregate=False)

[/usr/local/lib/python3.7/dist-packages/sdv/evaluation.py](https://localhost:8080/#) in _select_metrics(synthetic_data, metrics)
     89                 final_metrics[metric] = metric_classes[metric]
     90             except KeyError:
---> 91                 raise ValueError(f'Unknown {modality} metric: {metric}')
     92 
     93     return final_metrics, modality

ValueError: Unknown multi-table metric: MulticlassDecisionTreeClassifier

npatki avatar Jul 15 '22 14:07 npatki

Great news everyone! We now have a new SDV 1.0 (Beta!) release that allows you to evaluate single table models. We have also integrated with the SDMetrics Quality Report and other visualization functions.

For more information see the new, Single Table Evaluation docs.

npatki avatar Mar 09 '23 23:03 npatki