splink
splink copied to clipboard
[FEAT] Additional argument to filter comparisons shown in comparison viewer dashboard
Is your proposal related to a problem?
The comparison viewer dashboard shows num_example_rows
examples for every comparison vector present in df_predict
. For sufficiently large datasets and complex models, the number of comparison vectors can become prohibitively large (I have an example where the dashboard is 1.4 GB with num_example_rows=2
.
Currently, the only way to trim this down is to manipulate df_predict
. This can easily be done if you want to view comparisons with a match probability between say 0.5 and 0.999, but would be more difficult to show only comparison vectors that appear >N times. Either or both of these options would be helpful to include in the dashboard function.
Describe the solution you'd like
A min_count
argument so min_count=100
is one way to keep to a more manageable file size.
comparison_viewer_dashboard(
df_predict,
out_path,
overwrite=False,
num_example_rows=2,
return_html_as_string=False,
min_count=1
)