SplitApplyCombine.jl icon indicating copy to clipboard operation
SplitApplyCombine.jl copied to clipboard

Store first value in Dict directly in innerjoin

Open non-Jedi opened this issue 5 years ago • 1 comments

I saw some of the discussion in https://github.com/JuliaData/DataFrames.jl/issues/2340 and got curious about what was possible.

This avoids allocating a Vector for the case where l does not have multiple indices with the same value. For the smoke-test benchmark in https://github.com/JuliaData/DataFrames.jl/issues/2340#issuecomment-668566508, this reduces allocations by half and overall runtime by 10%.

Most of the allocations still come from this line which it's much less clear how to reduce allocations in. I'm not sure how much https://github.com/JuliaLang/julia/issues/24909 affects performance in this case. One option would be to heuristically estimate the size of out based on the size of l and r and call sizehint! on it; this didn't seem to help in my testing.

I realize I'm only optimizing a single method of innerjoin, but I'm not super familiar with this field nor with the inner workings of this package, so I leave it to you to decide if this is a worthwhile complication and if it's relevant elsewhere in the package.

non-Jedi avatar Aug 28 '20 03:08 non-Jedi

Ah thanks.

Another strategy I tried a while back was to first assume the join keys are distinct and then bail to a more general implementation when that's not the case. It would be awesome to compare the performance vesus this.

(Sorry @non-Jedi I haven't been keeping track of PRs...)

andyferris avatar Nov 09 '21 00:11 andyferris