Database-style indexes for efficient filtering?
Problem description
Apologies if this is flatly out of scope or otherwise based in misunderstandings of how Polars works.
I am working on an application where I will need to perform repeated range lookups on a Datetime column in a fairly large data frame, and I am thinking about using Polars for this project instead of Pandas.
I am concerned that these repeated range lookups could get very expensive when accumulated over the total running time of the application. Traditional relational databases like PostgreSQL offer several index types that can be used to avoid full table scans. And Pandas' "index-as-row-labels" system (at least in theory) can implement internal optimizations for these kinds of operations.
Understanding of course that Polars has chosen not to adopt Pandas' system, is there any plan to support some kind of external index system for Polars data frames? A hypothetical Python API could look something like this:
from datetime import datetime as DateTime
import polars as pl
data: pl.DataFrame = ... # big data frame
t1 = DateTime(2022, 1, 1)
t2 = DateTime(2022, 3, 1)
result1 = data.filter(
(pl.col('start_time') >= t1) & (pl.col('start_time') < t2)
)
start_time_index = pl.build_index(data, ['start_time'], method='btree')
result2 = data.filter(
(start_time_index >= t1) & (start_time_index < t2)
)
I could see this also being helpful when working with geospatial data.
Of course, we should not try to optimize heavily without having measured/profiled things first. And a system like this would inherit all of the problems associated with database indexes (need to be updated, not helpful with low cardinality, etc). But I'd be surprised if this couldn't end up as a bottleneck in certain applications.
Is this something that would be considered in-scope for the Polars project? Or is this a niche enough use case that most users aren't expected to need it, and if needed would be expected to craft or bring in their own indexes (e.g. find a suitable B-tree or R-tree library in the host language)?
Edit: in the particular example above, it would make sense to sort this data by start_time. Would .sort('start_time').rechunk() be helpful for this? I imagine you could also cook something up with .groupby_dynamic or even .partition_by depending on the exact usage.
Polars can already lever that data is sorted, see for instance this excellent blog post: https://braaannigan.github.io/software/2022/09/13/polars-with-set-sorted.html
The blog post is perfect while working with a single column filter. There is a way to make it work with multiple columns? For example, using polars to execute "<" and ">" operations to build a decision tree you might want to efficiently query any column.
@MiguelAngelTorres I think that may fall under the "self non-equi join" category?
I don't think anything currently exists to help with that.
- https://github.com/pola-rs/polars/issues/10068