pandas icon indicating copy to clipboard operation
pandas copied to clipboard

DataFrame fast search

Open alzmcr opened this issue 9 years ago • 15 comments

Would be really cool if would be implemented to quickly find and retrieve rows from DataFrame. Currently I previously sort a field and the use "searchsorted" function in order to get the index of the selection... but this is limited to one field only and still doesn't handle multiple searches. It would be really similar to the "query" functionality but will take advantages from using some sort of "internal" precomputed index in order to reduce latency.

alzmcr avatar Sep 08 '15 16:09 alzmcr

pls show an example. This generally is just an indexer, e.g.

In [23]: np.random.seed(1234)

In [24]: df = DataFrame({'A' : np.random.randn(10)})

In [25]: df[df.A>0.5]
Out[25]: 
          A
2  1.432707
5  0.887163
6  0.859588

In [26]: indexer = df>0.5

In [27]: indexer
Out[27]: 
       A
0  False
1  False
2   True
3  False
4  False
5   True
6   True
7  False
8  False
9  False

In [28]: df.loc[indexer.A]
Out[28]: 
          A
2  1.432707
5  0.887163
6  0.859588

jreback avatar Sep 08 '15 16:09 jreback

import pandas as pd
import numpy as np
import itertools

def search_sorted_subsets_field(df, field, keywords):
    subsets = []
    for keyword in keywords:
        left_bound = df[field].searchsorted(keyword,'left')[0]
        right_bound = df[field].searchsorted(keyword,'right')[0]
        if (right_bound-left_bound) > 0:
            subsets.append(df[left_bound:right_bound])
    return subsets

def search_sorted_subsets(df, searches):
    subsets = [df]
    for field, values in searches:
        subsets = list(itertools.chain(*[search_sorted_subsets_field(subset, field, values) for subset in subsets]))
    return pd.DataFrame(np.vstack(subsets) if len(subsets)>0 else None,columns=df.columns).convert_objects()

searches = [
    ['A',[1300,1400]],
    ['B', [2300]]
]

df = pd.DataFrame(np.array([range(1000,2000),range(2000,3000)]).T,columns=['A','B'])

print search_sorted_subsets(df, searches)

alzmcr avatar Sep 08 '15 17:09 alzmcr

why wouldn't you simply write a boolean expression? you can even do this dynamically, see here

In [12]: df[(df.A>=1300)&(df.A<1400)&(df.B==2300)]
Out[12]: 
        A     B
300  1300  2300

furthermore your example above discards the index

jreback avatar Sep 08 '15 17:09 jreback

Sorry - I was in rush yesterday so I couldn't explain in detail.

If you increase the size of the DataFrame you'll start noticing significant change in performance.

In [7]: df = pd.DataFrame(np.array([range(1000,299000),range(2000,300000)]).T,columns=['A','B'])

In [8]: %timeit _=df[(df.A>=1300)&(df.A<1400)&(df.B==2300)]
100 loops, best of 3: 12.9 ms per loop

In [9]: %timeit _=search_sorted_subsets(df, searches)
1000 loops, best of 3: 1.48 ms per loop

And if you keep increasing the length of the DataFrame, the difference would be even bigger. That's why this topic is call "fast search", not just search! I hope this made the point :)

Using bitwise operation between long series of booleans get can expensive, and the above example is the proof. I haven't found a way to take advantage of precomputed index to navigate and search terms in a data frame. You can sort the field you'll be searching as I did, in order to create a "rudimentary" sorted index...but I guess would be a very interesting feature to have in pandas.

I currently use pandas to store a seriously huge DataFrame in memory, which then I need to query multiple times in order to get very few rows out of it, filtering on a number of columns!

alzmcr avatar Sep 09 '15 08:09 alzmcr

Well you ought to look here: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers.

You are assuming things are sorted, and that they are monotonic increasing, these things are relatively expensive to compute.

However, in general you cannot assume sortedness., so in this case using searchsorted is a good idea.

The slicers can take advantage of this and will scale much better than boolean indexing.

jreback avatar Sep 09 '15 11:09 jreback

I wasn't aware of the slicer and how to use them - thanks! Although even using slicer and taking advantages of using sorted columns, is not as fast using "precompute indices".

I agree it's an expensive procedure, but would be very beneficial when you'll need to query the same DataFrame many many many times.

In [25]: %timeit _=search_with_index(data, searches)
100 loops, best of 3: 6.43 ms per loop

In [26]: %timeit _=data1.loc[(slice(0.1,0.3),slice(0.1,0.3)),:]
10 loops, best of 3: 22.5 ms per loop

Up to x3 times in this case with an other rusty and quick index implementation. So I'm not sure if that can be implemented with some better algos at some points! :)

import itertools
import numpy as np
import pandas as pd

def build_index(df, columns=None):
    if columns is None: columns = df.columns.tolist()
    # initialize index dictionary if not present
    if (not hasattr(df,'adhoc_indexes')): df.adhoc_indexes = {}
    for col in columns:
        index = {}
        for val in df[col].unique():
            index[val] = df[df[col]==val].index.tolist()
        df.adhoc_indexes[col] = index

def search_with_index(df, searches):
    idx = []
    for col,values in searches:
        # check if index for current column exists, if doesn't create it on the fly
        if (not hasattr(df,'adhoc_indexes')) or (col not in df.adhoc_indexes):
            build_index(df, columns=[col])
        # get indices for current column
        idx.append(
            set(itertools.chain(*[df.adhoc_indexes[col][val] for val in values]))
        )
    # get the intersections of the indices and return the requested rows from the dataframe
    return df.iloc[list(set.intersection(*idx))]

data = np.random.random((1000000,5)).round(2)
data = pd.DataFrame(data,columns=[col for col in 'ABCDFGHILK'[:data.shape[1]]])
data1 = data.copy().set_index(['A','B']).sort_index()

#searches = [['A',[0.1,0.3]]]
searches = [['A',[0.1,0.3]],['B',[0.1,0.3]]]

# build index
build_index(data, ['A','B'])

# timings
%timeit _=search_with_index(data, searches)

%timeit _=data1.loc[(slice(0.1,0.3),slice(0.1,0.3)),:]

alzmcr avatar Sep 09 '15 12:09 alzmcr

You are just caching lookups. frames already do this in a quite general way.

In [19]: %time build_index(data, ['A','B'])
CPU times: user 852 ms, sys: 30.4 ms, total: 882 ms
Wall time: 882 ms

You can certainly poke around if you'd like to try to implement some speedups. But your usecase is fairly narrow and no idea what an API would be useful / look-like.

jreback avatar Sep 09 '15 12:09 jreback

Since I've noticed this have been move here, I ended up addressing this problem using some runtime pandas already have for group by to speed everything even more up.

def search_dataframe_index(df, params):
    fields = tuple([x[0] for x in params])
    values = [x[1] for x in params]

    if not hasattr(df, '_search_groups'):
        df._search_groups = dict()

    if not fields in df._search_groups:
        # Building index for %s" % str(fields))
        df._search_groups[fields] = df.reset_index().groupby(fields).groups

    irows = []
    for i in itertools.product(*values):
        if i in df._search_groups[fields]:
            irows.append(df._search_groups[fields][i])

    return df.take(list(itertools.chain(*irows)))

# NEW METHOD - first time for caching
%time _=search_dataframe_index(data, searches)
CPU times: user 2.28 s, sys: 205 ms, total: 2.49 s
Wall time: 2.49 s

# NEW METHOD - average performance
%timeit _=search_dataframe_index(data, searches)
1000 loops, best of 3: 275 µs per loop

# PREVIOUS
%timeit _=search_with_index(data, searches)
100 loops, best of 3: 3.15 ms per loop

%timeit _=data1.loc[(slice(0.1,0.3),slice(0.1,0.3)),:]
10 loops, best of 3: 57.3 ms per loop

Is much faster when searching, bust is quite 'slow' when creating the 'index'. Still, in my production case brought down the processing time from 1 hour to 5 minutes :)

alzmcr avatar Apr 13 '17 10:04 alzmcr

@alzmcr Did you put this change into a PR?

humford avatar Jun 29 '17 13:06 humford

@humford nope but I could, I already have all the code. It's just a matter of deciding if it's elegant enough storing additional elements for indexing within a data frame and probably a better name for the function.

alzmcr avatar Jun 29 '17 13:06 alzmcr

@alzmcr In regards to the previous discussion, do you know where/how this might fit into the repo? It seems like this has been added to a release milestone and you've documented a pretty remarkable speed up so I think a PR might allow a more focused discussion on implementation.

humford avatar Jun 29 '17 14:06 humford

@alzmcr Could you give an example with a data set? I am finding it hard to understand what the parameter param look like?

shishirpy avatar Apr 23 '18 14:04 shishirpy

@shishirpy a full example is at line: https://github.com/pandas-dev/pandas/issues/11026#issuecomment-138889948 - It's confusing as in the example are called searches and in the function name are called params.

I would love to open a PR, but I'm struggling with time and would be the first time here!

alzmcr avatar Apr 25 '18 08:04 alzmcr

Remember that the index needs to be updated upon each mutation of its parent DataFrame/column.

kokes avatar Oct 02 '18 15:10 kokes

AFAICT this boils down to trying to optimize MultiIndex.get_locs. I haven't looked at it closely, but its plausible that there are optimizations available in lexsorted cases by caching more aggressively.

jbrockmendel avatar Apr 17 '20 03:04 jbrockmendel