openff-evaluator icon indicating copy to clipboard operation
openff-evaluator copied to clipboard

Filtering w/o Pandas?

Open lilyminium opened this issue 3 years ago • 3 comments

Have you considered filtering data sets without converting to Pandas under the hood? It can be difficult to hold both in memory at the same time, especially as the dataframe is wide format. This would also allow for progress bars. My intuition is that creating an index mask array from the properties directly and creating a new dataset from that will be substantially faster than converting to and from pandas.

Edit: happy to implement this and benchmark it, but I'm not sure how many users are relying on using dataframes for filtering.

Edit: It's also kind of odd that the filters don't take units, given the emphasis placed on units elsewhere by OpenFF. While it's clearly because of the imposed units from the dataframe, it's also not clear at all if you start the filtering wiht a dataset. e.g.

pressure_filter = FilterByPressure.apply(
    data_set,
    FilterByPressureSchema(minimum_pressure=101.224, maximum_pressure=101.426)
)

lilyminium avatar Nov 16 '21 17:11 lilyminium

Using this function from StackOverflow (code below), a dataframe of a mixed data set is about twice the size of the actual data set (the data set in question is deduplicated ThermoML, so 704447 properties).

>>> df = data_set.to_pandas()
>>> get_obj_size(data_set)
2940669776
>>> get_obj_size(df)
6402499431

Code:

import gc
import sys

def get_obj_size(obj):
    marked = {id(obj)}
    obj_q = [obj]
    sz = 0

    while obj_q:
        sz += sum(map(sys.getsizeof, obj_q))

        # Lookup all the object referred to by the object in obj_q.
        # See: https://docs.python.org/3.7/library/gc.html#gc.get_referents
        all_refr = ((id(o), o) for o in gc.get_referents(*obj_q))

        # Filter object that are already marked.
        # Using dict notation will prevent repeated objects.
        new_refr = {o_id: o for o_id, o in all_refr if o_id not in marked and not isinstance(o, type)}

        # The new obj_q will be the ones that were not marked,
        # and we will update marked with their ids so we will
        # not traverse them again.
        obj_q = new_refr.values()
        marked.update(new_refr.keys())

    return sz

lilyminium avatar Nov 16 '21 18:11 lilyminium

I thought the disparity would be smaller with a one-property-type data set, but the dataframe number is suspiciously close to the one for all of ThermoML. There are only 2912 properties in this data set. Hmmmm.

>>> get_obj_size(solvation_data)
13313813
>>> get_obj_size(solvation_data_df)
6402502040

lilyminium avatar Nov 16 '21 18:11 lilyminium

Ref #506 (though I'm not caught up enough to understand this one)

mattwthompson avatar Apr 10 '23 17:04 mattwthompson