openff-evaluator
openff-evaluator copied to clipboard
Getting rid of data point if thermophysical data is not included
Is your feature request related to a problem? Please describe. When using ThermoML dois as input data in evaluator for filtering, sometimes there are no values for pressure or temperature. Because evaluator expects this thermodynamic properties, loading and/or filtering data will rise an error. The error basically arises from the fact that every value of pressure (for example) in every row is getting turned into a physical property object, and if there are no values there, then the code breaks.
Describe the solution you'd like It would be better that evaluator removes these data points without complete thermodynamic data automatically before the code breaks, or make evaluator accept these with a warning.
Describe alternatives you've considered I manually removed the data points without complete thermodynamic data by using dropna().
Additional context I attach to this issue an input json file (sorted_dois.json)
Here is the example python code to replicate the error:
import pandas as pd
import json
from pathlib import Path
from openff.evaluator.datasets import PhysicalProperty, PropertyPhase
from openff.evaluator.datasets.thermoml import thermoml_property
from openff.evaluator import properties
from openff.units import unit
from openff.evaluator.datasets.thermoml import ThermoMLDataSet
@thermoml_property("Osmotic coefficient", supported_phases=PropertyPhase.Liquid | PropertyPhase.Gas)
class OsmoticCoefficient(PhysicalProperty):
"""A class representation of a osmotic coeff property"""
@classmethod
def default_unit(cls):
return unit.dimensionless
setattr(properties, OsmoticCoefficient.__name__, OsmoticCoefficient)
from openff.evaluator.datasets.thermoml import ThermoMLDataSet
CACHED_PROP_PATH = Path('osmotic_data.csv')
if CACHED_PROP_PATH.exists():
prop_df = pd.read_csv(CACHED_PROP_PATH, index_col=0)
## delete rows with undefined thermodynamic parameters to avoid indexing errors
# prop_df = prop_df.dropna(subset=['Temperature (K)'])
# prop_df = prop_df.dropna(subset=['Pressure (kPa)'])
data_set = ThermoMLDataSet.from_pandas(prop_df)
else:
with open('sorted_dois.json') as f:
doi_dat = json.load(f)
data_set = ThermoMLDataSet.from_doi(*doi_dat['working'])
prop_df = data_set.to_pandas()
with CACHED_PROP_PATH.open('w') as file:
prop_df.to_csv(CACHED_PROP_PATH)
I ran into this too, it would be convenient for this to happen automatically in from_pandas.
@barmoral Thanks for providing a reproduction I can easily get started on. How long does this script take to run, though? It's been a few minutes (probably just fetching the data?) and I want to make sure that's not surprising
Okay, it finished. I was just a little impatient.
What columns should we drop rows based off of? This dataframe has plenty of missing pressure data, but no missing temperature or phase data. Some other columns are always missing so we can't just call .dropna() without arguments:
In [25]: prop_df.isnull().sum()
Out[25]:
Id 0
Temperature (K) 0
Pressure (kPa) 1957
Phase 0
N Components 0
Component 1 0
Role 1 0
Mole Fraction 1 0
Exact Amount 1 5347
Component 2 20
Role 2 20
Mole Fraction 2 20
Exact Amount 2 5347
Component 3 3562
Role 3 3562
Mole Fraction 3 3562
Exact Amount 3 5347
Density Value (g / ml) 4741
Density Uncertainty (g / ml) 4741
OsmoticCoefficient Value () 606
OsmoticCoefficient Uncertainty () 606
Source 0
dtype: int64
In [34]: prop_df.dropna()
Out[34]:
Empty DataFrame
Columns: [Id, Temperature (K), Pressure (kPa), Phase, N Components, Component 1, Role 1, Mole Fraction 1, Exact Amount 1, Component 2, Role 2, Mole Fraction 2, Exact Amount 2, Component 3, Role 3, Mole Fraction 3, Exact Amount 3, Density Value (g / ml), Density Uncertainty (g / ml), OsmoticCoefficient Value (), OsmoticCoefficient Uncertainty (), Source]
Index: []
My guess is we want to consider pressure, temperature, and phase. For this data, it strips out some but not most of the dataset:
In [33]: prop_df.describe(), prop_df.dropna(subset=['Pressure (kPa)', 'Temperature (K)', 'Phase']).describe()
Out[33]:
( Temperature (K) Pressure (kPa) N Components ... Density Uncertainty (g / ml) OsmoticCoefficient Value () OsmoticCoefficient Uncertainty ()
count 5347.000000 3390.000000 5347.000000 ... 606.000000 4741.000000 4741.000000
mean 305.864640 98.806342 2.330092 ... 0.001153 0.819905 0.028031
std 12.173756 5.088289 0.478179 ... 0.000988 0.327787 0.139206
min 273.000000 84.500000 1.000000 ... 0.000034 0.146000 0.000050
25% 298.150000 101.000000 2.000000 ... 0.000227 0.690000 0.005000
50% 298.150000 101.000000 2.000000 ... 0.001505 0.827700 0.006500
75% 313.150000 101.000000 3.000000 ... 0.001875 0.929000 0.010000
max 353.150000 101.325000 3.000000 ... 0.011740 6.362000 1.000000
[8 rows x 12 columns],
Temperature (K) Pressure (kPa) N Components ... Density Uncertainty (g / ml) OsmoticCoefficient Value () OsmoticCoefficient Uncertainty ()
count 3390.000000 3390.000000 3390.000000 ... 606.000000 2784.000000 2784.000000
mean 305.332354 98.806342 2.410324 ... 0.001153 0.862405 0.041581
std 12.601643 5.088289 0.497334 ... 0.000988 0.390842 0.180353
min 273.000000 84.500000 1.000000 ... 0.000034 0.146000 0.000050
25% 298.150000 101.000000 2.000000 ... 0.000227 0.716000 0.005000
50% 298.150000 101.000000 2.000000 ... 0.001505 0.859000 0.006000
75% 313.150000 101.000000 3.000000 ... 0.001875 0.944000 0.008500
max 353.150000 101.325000 3.000000 ... 0.011740 6.362000 1.000000
[8 rows x 12 columns])
But I wonder if you also want rows stripped out if density or osmotic coefficient (, ...) are missing?
@mattwthompson Thanks for checking this out! No, I don't mind if density or osmotic coefficients are missing. If it is possible, it would just be helpful that the code runs even if there is data missing and takes into consideration the data that is actually there, instead of deleting the whole data point. If not possible, maybe let you know which data points are missing data and therefore will be thrown out when filtering for a specific property.