openff-evaluator icon indicating copy to clipboard operation
openff-evaluator copied to clipboard

Add `tidy` keyword to to_pandas?

Open lilyminium opened this issue 3 years ago • 0 comments

I was surprised that .to_pandas converts to a wide format where each property type gets its own column and imposed unit. I would have thought it more intuitive to convert to a tidier format. i.e.

Instead of:

Index(['Id', 'Temperature (K)', 'Pressure (kPa)', 'Phase', 'N Components',
       'Component 1', 'Role 1', 'Mole Fraction 1', 'Exact Amount 1',
       'Component 2', 'Role 2', 'Mole Fraction 2', 'Exact Amount 2',
       'SolvationFreeEnergy Value (kJ / mol)',
       'SolvationFreeEnergy Uncertainty (kJ / mol)', 'Source'],
      dtype='object')

You could have:

Index(['Id', 'Temperature (K)', 'Pressure (kPa)', 'Phase', 'N Components',
       'Component 1', 'Role 1', 'Mole Fraction 1', 'Exact Amount 1',
       'Component 2', 'Role 2', 'Mole Fraction 2', 'Exact Amount 2',
       'Property type', 'Value', 'Value unit', 'Uncertainty', 'Uncertainty unit', 'Source'],
      dtype='object')

This would be more efficient memory-wise (edit: for mixed datasets), as you no longer have NaNs taking up a bunch of space, as well as help in filtering by property type. When working direclty with the dataframe it would be much easier to see how many of each property type you have and to group by it.

lilyminium avatar Nov 16 '21 17:11 lilyminium