SDMetrics
SDMetrics copied to clipboard
I want to be able to modify synthetic_sample_size in the diagnostic report (both single-table and multi-table).
Problem Description
I cannot override the synthetic sample size used in the diagnostic report for the NewRowSynthesis metric, for both single-table and multiple-table diagnostic reports. Currently, I am doing this to override it, which is not a good solution:
single_table_diagnostic_report._metric_args['NewRowSynthesis']['synthetic_sample_size'] = np.inf
Would be greate if there was a parameter in the DiagnosticReport() classes in single_table and multi_table packages.
Hi @echatzikyriakidis, nice to meet you and thanks for filing the issue!
I'm curious how you would plan to use such a feature and how it would help you create a better DiagnosticReport
. From the example, it seems like you want to set this value infinite -- how would this help you? Eg. Is there something particular to your data that requires this?
Hi again @npatki,
I am hoping that I could be able to control the sample size of NewRowSynthesis in DiagnosticReport (maybe through a constructor parameter?) to control it. e.g. I would like to set it to None sometimes for the default operation (to check all rows) and some other times to limit it myself (the sample size).
I want to see if my synthetic dataset has novel rows or not.
Hi @echatzikyriakidis, in the NewRowSynthesis
metric, there is a tradeoff between the amount of data that we use (synthetic_sample_size
) and the amount of time it takes to run the report. For very large datasets, it may be infeasible to use the full dataset.
I'm curious how you are determining when to use None
and when to use another parameter. How do you decide what number to set it to?
Note that you may always apply the NewRowSynthesis metric yourself outside of the report, if you'd like to play around with this parameter.
Hi again,
I don't have any heuristic formula for it. I just wanted to test it. Thanks for letting me know about the trade-off. For now, I might test the synthetic dataset the way you said, outside the report.
Thank you!