DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

filter compatibility with AcceleratedArrays optimized methods

Open nalimilan opened this issue 3 years ago • 7 comments

filter currently broadcasts the predicate function over the input column(s) to get a logical index. This means that even if you filter on an AcceleratedArray column, the optimized findall method won't be used.

Currently, broadcasting is faster, even if under the hood the logical index vector is converted to an integer index vector using findall by a getindex DataFrames method (which is faster than indexing each column with a logical index). https://github.com/JuliaLang/julia/pull/37177 is an attempt to make findall could be faster in Base, in which case filter could use it when filtering on a single column. Then we would be able to take advantage of AcceleratedArrays methods for free, which could be useful as a key/indexed column.

Cc: @andyferris

nalimilan avatar Aug 24 '20 13:08 nalimilan

Since the above PR in Julia is available in 1.7, maybe it's time to revisit this? Not sure what's needed here.

I'm interested in this for GeoDataFrames, which introduces a geometry column for which a spatial index would be useful to speed up spatial operations. Like AcceleratedArray, I've started GeoAcceleratedArrays in the past that could be used for this.

@joshday

evetion avatar Mar 29 '22 11:03 evetion

Sure. Would you be willing to give it a try? It looks like it should be easy to do for filter (we already use findall for filter! so the code could be similar). For subset, we should probably only use findall when a single condition is passed, so maybe the simplest approach is to add a branch for that case.

nalimilan avatar Mar 29 '22 12:03 nalimilan

@nalimilan I can implement this. Any suggestions for where these broadcasts are in the codebase?

rafaqz avatar Jul 05 '22 13:07 rafaqz

For filter the relevant line is: https://github.com/JuliaData/DataFrames.jl/blob/14e682b623ccc07d34cb7399f79311e9443f9b13/src/abstractdataframe/abstractdataframe.jl#L1133

For subset it's more tricky as I noted above, but the relevant code is here: https://github.com/JuliaData/DataFrames.jl/blob/14e682b623ccc07d34cb7399f79311e9443f9b13/src/abstractdataframe/subset.jl#L78-L79

nalimilan avatar Jul 05 '22 14:07 nalimilan

Ah thanks. Yes I found that for filter. And what about _filter!_helper ? seems like it could also use findall for a single column.

I'll have a look at _get_subset_conditions for subset.

rafaqz avatar Jul 05 '22 15:07 rafaqz

Thank you for willingness to work on this. Two organizational comments:

  1. it would be good to have a feedback from @andyferris about AcceleratedArrays.jl support plans (not crucial but would be nice to be in sync here).
  2. if you make a PR please know that DataFrames.jl is now in feature freeze for 1.4 release. The reason is that https://github.com/JuliaData/DataFrames.jl/pull/3055 changes almost all source code. Therefore you can expect risk of needing to rebase your PR later (again - not crucial, but I just want you to know this).

bkamins avatar Jul 05 '22 16:07 bkamins

Ok I'll wait for that PR.

rafaqz avatar Jul 05 '22 17:07 rafaqz