pandas
pandas copied to clipboard
DataFrame fast search
Would be really cool if would be implemented to quickly find and retrieve rows from DataFrame. Currently I previously sort a field and the use "searchsorted" function in order to get the index of the selection... but this is limited to one field only and still doesn't handle multiple searches. It would be really similar to the "query" functionality but will take advantages from using some sort of "internal" precomputed index in order to reduce latency.
pls show an example. This generally is just an indexer, e.g.
In [23]: np.random.seed(1234)
In [24]: df = DataFrame({'A' : np.random.randn(10)})
In [25]: df[df.A>0.5]
Out[25]:
A
2 1.432707
5 0.887163
6 0.859588
In [26]: indexer = df>0.5
In [27]: indexer
Out[27]:
A
0 False
1 False
2 True
3 False
4 False
5 True
6 True
7 False
8 False
9 False
In [28]: df.loc[indexer.A]
Out[28]:
A
2 1.432707
5 0.887163
6 0.859588
import pandas as pd
import numpy as np
import itertools
def search_sorted_subsets_field(df, field, keywords):
subsets = []
for keyword in keywords:
left_bound = df[field].searchsorted(keyword,'left')[0]
right_bound = df[field].searchsorted(keyword,'right')[0]
if (right_bound-left_bound) > 0:
subsets.append(df[left_bound:right_bound])
return subsets
def search_sorted_subsets(df, searches):
subsets = [df]
for field, values in searches:
subsets = list(itertools.chain(*[search_sorted_subsets_field(subset, field, values) for subset in subsets]))
return pd.DataFrame(np.vstack(subsets) if len(subsets)>0 else None,columns=df.columns).convert_objects()
searches = [
['A',[1300,1400]],
['B', [2300]]
]
df = pd.DataFrame(np.array([range(1000,2000),range(2000,3000)]).T,columns=['A','B'])
print search_sorted_subsets(df, searches)
why wouldn't you simply write a boolean expression? you can even do this dynamically, see here
In [12]: df[(df.A>=1300)&(df.A<1400)&(df.B==2300)]
Out[12]:
A B
300 1300 2300
furthermore your example above discards the index
Sorry - I was in rush yesterday so I couldn't explain in detail.
If you increase the size of the DataFrame you'll start noticing significant change in performance.
In [7]: df = pd.DataFrame(np.array([range(1000,299000),range(2000,300000)]).T,columns=['A','B'])
In [8]: %timeit _=df[(df.A>=1300)&(df.A<1400)&(df.B==2300)]
100 loops, best of 3: 12.9 ms per loop
In [9]: %timeit _=search_sorted_subsets(df, searches)
1000 loops, best of 3: 1.48 ms per loop
And if you keep increasing the length of the DataFrame, the difference would be even bigger. That's why this topic is call "fast search", not just search! I hope this made the point :)
Using bitwise operation between long series of booleans get can expensive, and the above example is the proof. I haven't found a way to take advantage of precomputed index to navigate and search terms in a data frame. You can sort the field you'll be searching as I did, in order to create a "rudimentary" sorted index...but I guess would be a very interesting feature to have in pandas.
I currently use pandas to store a seriously huge DataFrame in memory, which then I need to query multiple times in order to get very few rows out of it, filtering on a number of columns!
Well you ought to look here: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers.
You are assuming things are sorted, and that they are monotonic increasing, these things are relatively expensive to compute.
However, in general you cannot assume sortedness., so in this case using searchsorted is a good idea.
The slicers can take advantage of this and will scale much better than boolean indexing.
I wasn't aware of the slicer and how to use them - thanks! Although even using slicer and taking advantages of using sorted columns, is not as fast using "precompute indices".
I agree it's an expensive procedure, but would be very beneficial when you'll need to query the same DataFrame many many many times.
In [25]: %timeit _=search_with_index(data, searches)
100 loops, best of 3: 6.43 ms per loop
In [26]: %timeit _=data1.loc[(slice(0.1,0.3),slice(0.1,0.3)),:]
10 loops, best of 3: 22.5 ms per loop
Up to x3 times in this case with an other rusty and quick index implementation. So I'm not sure if that can be implemented with some better algos at some points! :)
import itertools
import numpy as np
import pandas as pd
def build_index(df, columns=None):
if columns is None: columns = df.columns.tolist()
# initialize index dictionary if not present
if (not hasattr(df,'adhoc_indexes')): df.adhoc_indexes = {}
for col in columns:
index = {}
for val in df[col].unique():
index[val] = df[df[col]==val].index.tolist()
df.adhoc_indexes[col] = index
def search_with_index(df, searches):
idx = []
for col,values in searches:
# check if index for current column exists, if doesn't create it on the fly
if (not hasattr(df,'adhoc_indexes')) or (col not in df.adhoc_indexes):
build_index(df, columns=[col])
# get indices for current column
idx.append(
set(itertools.chain(*[df.adhoc_indexes[col][val] for val in values]))
)
# get the intersections of the indices and return the requested rows from the dataframe
return df.iloc[list(set.intersection(*idx))]
data = np.random.random((1000000,5)).round(2)
data = pd.DataFrame(data,columns=[col for col in 'ABCDFGHILK'[:data.shape[1]]])
data1 = data.copy().set_index(['A','B']).sort_index()
#searches = [['A',[0.1,0.3]]]
searches = [['A',[0.1,0.3]],['B',[0.1,0.3]]]
# build index
build_index(data, ['A','B'])
# timings
%timeit _=search_with_index(data, searches)
%timeit _=data1.loc[(slice(0.1,0.3),slice(0.1,0.3)),:]
You are just caching lookups. frames already do this in a quite general way.
In [19]: %time build_index(data, ['A','B'])
CPU times: user 852 ms, sys: 30.4 ms, total: 882 ms
Wall time: 882 ms
You can certainly poke around if you'd like to try to implement some speedups. But your usecase is fairly narrow and no idea what an API would be useful / look-like.
Since I've noticed this have been move here, I ended up addressing this problem using some runtime pandas already have for group by to speed everything even more up.
def search_dataframe_index(df, params):
fields = tuple([x[0] for x in params])
values = [x[1] for x in params]
if not hasattr(df, '_search_groups'):
df._search_groups = dict()
if not fields in df._search_groups:
# Building index for %s" % str(fields))
df._search_groups[fields] = df.reset_index().groupby(fields).groups
irows = []
for i in itertools.product(*values):
if i in df._search_groups[fields]:
irows.append(df._search_groups[fields][i])
return df.take(list(itertools.chain(*irows)))
# NEW METHOD - first time for caching
%time _=search_dataframe_index(data, searches)
CPU times: user 2.28 s, sys: 205 ms, total: 2.49 s
Wall time: 2.49 s
# NEW METHOD - average performance
%timeit _=search_dataframe_index(data, searches)
1000 loops, best of 3: 275 µs per loop
# PREVIOUS
%timeit _=search_with_index(data, searches)
100 loops, best of 3: 3.15 ms per loop
%timeit _=data1.loc[(slice(0.1,0.3),slice(0.1,0.3)),:]
10 loops, best of 3: 57.3 ms per loop
Is much faster when searching, bust is quite 'slow' when creating the 'index'. Still, in my production case brought down the processing time from 1 hour to 5 minutes :)
@alzmcr Did you put this change into a PR?
@humford nope but I could, I already have all the code. It's just a matter of deciding if it's elegant enough storing additional elements for indexing within a data frame and probably a better name for the function.
@alzmcr In regards to the previous discussion, do you know where/how this might fit into the repo? It seems like this has been added to a release milestone and you've documented a pretty remarkable speed up so I think a PR might allow a more focused discussion on implementation.
@alzmcr Could you give an example with a data set? I am finding it hard to understand what the parameter param
look like?
@shishirpy a full example is at line: https://github.com/pandas-dev/pandas/issues/11026#issuecomment-138889948 - It's confusing as in the example are called searches and in the function name are called params.
I would love to open a PR, but I'm struggling with time and would be the first time here!
Remember that the index needs to be updated upon each mutation of its parent DataFrame/column.
AFAICT this boils down to trying to optimize MultiIndex.get_locs
. I haven't looked at it closely, but its plausible that there are optimizations available in lexsorted cases by caching more aggressively.