openff-evaluator
openff-evaluator copied to clipboard
Filtering with FilterDuplicates alters precision
trafficstars
Describe the bug
The rounded data frame is returned from FilterDuplicates, which causes information loss and means potentially fitting to possibly incorrect data.
To Reproduce
df = pd.read_csv("dataset.csv")
filtered = CurationWorkflow.apply(
df,
CurationWorkflowSchema(
component_schemas=[
filtering.FilterDuplicatesSchema(
mole_fraction_precision=2,
),
]
)
)
assert list(df["Mole Fraction 1"]) == list(filtered["Mole Fraction 1"])
Output
Computing environment (please complete the following information):
- Operating system
- Output of running
conda list
Additional context
The below fixed it for me, but I won't have time to come up with a MWE test for a bit.
diff --git a/openff/evaluator/datasets/curation/components/filtering.py b/openff/evaluator/datasets/curation/components/filtering.py
index e82fb47..281be64 100644
--- a/openff/evaluator/datasets/curation/components/filtering.py
+++ b/openff/evaluator/datasets/curation/components/filtering.py
@@ -140,7 +140,11 @@ class FilterDuplicates(CurationComponent):
filtered_data.append(sorted_filtered_data)
filtered_data = pandas.concat(filtered_data, ignore_index=True, sort=False)
- return filtered_data
+
+ original_filtered_data = data_frame[
+ data_frame["Id"].isin(filtered_data["Id"])
+ ]
+ return original_filtered_data
Unless I'm missing something obvious, you already have a MWE in place?
Tidied up a tiny bit for lazier copy-paste:
In [1]: from openff.evaluator.datasets.curation.components import filtering
...: from openff.evaluator.datasets.curation.workflow import (
...: CurationWorkflowSchema,
...: CurationWorkflow,
...: )
...: import pandas as pd
...:
...: df = pd.read_csv("dataset.csv")
...: filtered = CurationWorkflow.apply(
...: df,
...: CurationWorkflowSchema(
...: component_schemas=[
...: filtering.FilterDuplicatesSchema(
...: mole_fraction_precision=2,
...: ),
...: ]
...: ),
...: )
...: assert list(df["Mole Fraction 1"]) == list(filtered["Mole Fraction 1"]), (
...: df["Mole Fraction 1"],
...: filtered["Mole Fraction 1"],
...: )
...:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[1], line 19
8 df = pd.read_csv("dataset.csv")
9 filtered = CurationWorkflow.apply(
10 df,
11 CurationWorkflowSchema(
(...)
17 ),
18 )
---> 19 assert list(df["Mole Fraction 1"]) == list(filtered["Mole Fraction 1"]), (
20 df["Mole Fraction 1"],
21 filtered["Mole Fraction 1"],
22 )
AssertionError: (0 0.7912
1 0.2112
Name: Mole Fraction 1, dtype: float64, 0 0.79
1 0.21
Name: Mole Fraction 1, dtype: float64)