SplitApplyCombine.jl
SplitApplyCombine.jl copied to clipboard
Store first value in Dict directly in innerjoin
I saw some of the discussion in https://github.com/JuliaData/DataFrames.jl/issues/2340 and got curious about what was possible.
This avoids allocating a Vector for the case where l does not have
multiple indices with the same value. For the smoke-test benchmark in
https://github.com/JuliaData/DataFrames.jl/issues/2340#issuecomment-668566508,
this reduces allocations by half and overall runtime by 10%.
Most of the allocations still come from this
line
which it's much less clear how to reduce allocations in. I'm not sure
how much https://github.com/JuliaLang/julia/issues/24909 affects
performance in this case. One option would be to heuristically
estimate the size of out based on the size of l and r and call
sizehint! on it; this didn't seem to help in my testing.
I realize I'm only optimizing a single method of innerjoin, but I'm
not super familiar with this field nor with the inner workings of this
package, so I leave it to you to decide if this is a worthwhile
complication and if it's relevant elsewhere in the package.
Ah thanks.
Another strategy I tried a while back was to first assume the join keys are distinct and then bail to a more general implementation when that's not the case. It would be awesome to compare the performance vesus this.
(Sorry @non-Jedi I haven't been keeping track of PRs...)