DataFrames.jl
DataFrames.jl copied to clipboard
filter compatibility with AcceleratedArrays optimized methods
filter
currently broadcasts the predicate function over the input column(s) to get a logical index. This means that even if you filter on an AcceleratedArray
column, the optimized findall
method won't be used.
Currently, broadcasting is faster, even if under the hood the logical index vector is converted to an integer index vector using findall
by a getindex
DataFrames method (which is faster than indexing each column with a logical index). https://github.com/JuliaLang/julia/pull/37177 is an attempt to make findall
could be faster in Base, in which case filter
could use it when filtering on a single column. Then we would be able to take advantage of AcceleratedArrays methods for free, which could be useful as a key/indexed column.
Cc: @andyferris
Since the above PR in Julia is available in 1.7, maybe it's time to revisit this? Not sure what's needed here.
I'm interested in this for GeoDataFrames, which introduces a geometry column for which a spatial index would be useful to speed up spatial operations. Like AcceleratedArray
, I've started GeoAcceleratedArrays in the past that could be used for this.
@joshday
Sure. Would you be willing to give it a try? It looks like it should be easy to do for filter
(we already use findall
for filter!
so the code could be similar). For subset
, we should probably only use findall
when a single condition is passed, so maybe the simplest approach is to add a branch for that case.
@nalimilan I can implement this. Any suggestions for where these broadcasts are in the codebase?
For filter
the relevant line is:
https://github.com/JuliaData/DataFrames.jl/blob/14e682b623ccc07d34cb7399f79311e9443f9b13/src/abstractdataframe/abstractdataframe.jl#L1133
For subset
it's more tricky as I noted above, but the relevant code is here:
https://github.com/JuliaData/DataFrames.jl/blob/14e682b623ccc07d34cb7399f79311e9443f9b13/src/abstractdataframe/subset.jl#L78-L79
Ah thanks. Yes I found that for filter. And what about _filter!_helper
? seems like it could also use findall
for a single column.
I'll have a look at _get_subset_conditions
for subset.
Thank you for willingness to work on this. Two organizational comments:
- it would be good to have a feedback from @andyferris about AcceleratedArrays.jl support plans (not crucial but would be nice to be in sync here).
- if you make a PR please know that DataFrames.jl is now in feature freeze for 1.4 release. The reason is that https://github.com/JuliaData/DataFrames.jl/pull/3055 changes almost all source code. Therefore you can expect risk of needing to rebase your PR later (again - not crucial, but I just want you to know this).
Ok I'll wait for that PR.