openff-evaluator icon indicating copy to clipboard operation
openff-evaluator copied to clipboard

Filtering with FilterDuplicates alters precision

Open lilyminium opened this issue 8 months ago • 1 comments
trafficstars

Describe the bug

The rounded data frame is returned from FilterDuplicates, which causes information loss and means potentially fitting to possibly incorrect data.

To Reproduce

df = pd.read_csv("dataset.csv")
filtered = CurationWorkflow.apply(
    df,
    CurationWorkflowSchema(
        component_schemas=[
            filtering.FilterDuplicatesSchema(
                mole_fraction_precision=2,
            ),
        ]
    )
)
assert list(df["Mole Fraction 1"]) == list(filtered["Mole Fraction 1"])

Output

Computing environment (please complete the following information):

  • Operating system
  • Output of running conda list

Additional context

dataset.csv

lilyminium avatar Mar 17 '25 06:03 lilyminium

The below fixed it for me, but I won't have time to come up with a MWE test for a bit.

diff --git a/openff/evaluator/datasets/curation/components/filtering.py b/openff/evaluator/datasets/curation/components/filtering.py
index e82fb47..281be64 100644
--- a/openff/evaluator/datasets/curation/components/filtering.py
+++ b/openff/evaluator/datasets/curation/components/filtering.py
@@ -140,7 +140,11 @@ class FilterDuplicates(CurationComponent):
             filtered_data.append(sorted_filtered_data)

         filtered_data = pandas.concat(filtered_data, ignore_index=True, sort=False)
-        return filtered_data
+
+        original_filtered_data = data_frame[
+            data_frame["Id"].isin(filtered_data["Id"])
+        ]
+        return original_filtered_data

lilyminium avatar Mar 17 '25 06:03 lilyminium

Unless I'm missing something obvious, you already have a MWE in place?

Tidied up a tiny bit for lazier copy-paste:

In [1]: from openff.evaluator.datasets.curation.components import filtering
   ...: from openff.evaluator.datasets.curation.workflow import (
   ...:     CurationWorkflowSchema,
   ...:     CurationWorkflow,
   ...: )
   ...: import pandas as pd
   ...:
   ...: df = pd.read_csv("dataset.csv")
   ...: filtered = CurationWorkflow.apply(
   ...:     df,
   ...:     CurationWorkflowSchema(
   ...:         component_schemas=[
   ...:             filtering.FilterDuplicatesSchema(
   ...:                 mole_fraction_precision=2,
   ...:             ),
   ...:         ]
   ...:     ),
   ...: )
   ...: assert list(df["Mole Fraction 1"]) == list(filtered["Mole Fraction 1"]), (
   ...:     df["Mole Fraction 1"],
   ...:     filtered["Mole Fraction 1"],
   ...: )
   ...:
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[1], line 19
      8 df = pd.read_csv("dataset.csv")
      9 filtered = CurationWorkflow.apply(
     10     df,
     11     CurationWorkflowSchema(
   (...)
     17     ),
     18 )
---> 19 assert list(df["Mole Fraction 1"]) == list(filtered["Mole Fraction 1"]), (
     20     df["Mole Fraction 1"],
     21     filtered["Mole Fraction 1"],
     22 )

AssertionError: (0    0.7912
1    0.2112
Name: Mole Fraction 1, dtype: float64, 0    0.79
1    0.21
Name: Mole Fraction 1, dtype: float64)

mattwthompson avatar Apr 22 '25 15:04 mattwthompson