SDMetrics icon indicating copy to clipboard operation
SDMetrics copied to clipboard

I want to be able to modify synthetic_sample_size in the diagnostic report (both single-table and multi-table).

Open echatzikyriakidis opened this issue 1 year ago • 4 comments

Problem Description

I cannot override the synthetic sample size used in the diagnostic report for the NewRowSynthesis metric, for both single-table and multiple-table diagnostic reports. Currently, I am doing this to override it, which is not a good solution:

single_table_diagnostic_report._metric_args['NewRowSynthesis']['synthetic_sample_size'] = np.inf

Would be greate if there was a parameter in the DiagnosticReport() classes in single_table and multi_table packages.

echatzikyriakidis avatar Mar 13 '23 17:03 echatzikyriakidis

Hi @echatzikyriakidis, nice to meet you and thanks for filing the issue!

I'm curious how you would plan to use such a feature and how it would help you create a better DiagnosticReport. From the example, it seems like you want to set this value infinite -- how would this help you? Eg. Is there something particular to your data that requires this?

npatki avatar Mar 13 '23 21:03 npatki

Hi again @npatki,

I am hoping that I could be able to control the sample size of NewRowSynthesis in DiagnosticReport (maybe through a constructor parameter?) to control it. e.g. I would like to set it to None sometimes for the default operation (to check all rows) and some other times to limit it myself (the sample size).

I want to see if my synthetic dataset has novel rows or not.

echatzikyriakidis avatar Mar 13 '23 21:03 echatzikyriakidis

Hi @echatzikyriakidis, in the NewRowSynthesis metric, there is a tradeoff between the amount of data that we use (synthetic_sample_size) and the amount of time it takes to run the report. For very large datasets, it may be infeasible to use the full dataset.

I'm curious how you are determining when to use None and when to use another parameter. How do you decide what number to set it to?

Note that you may always apply the NewRowSynthesis metric yourself outside of the report, if you'd like to play around with this parameter.

npatki avatar Mar 13 '23 22:03 npatki

Hi again,

I don't have any heuristic formula for it. I just wanted to test it. Thanks for letting me know about the trade-off. For now, I might test the synthetic dataset the way you said, outside the report.

Thank you!

echatzikyriakidis avatar Mar 13 '23 23:03 echatzikyriakidis