ParallelAccelerator.jl icon indicating copy to clipboard operation
ParallelAccelerator.jl copied to clipboard

Reduction performance issue

Open ehsantn opened this issue 9 years ago • 5 comments

Seems like there might be a performance issue with the new "manual" reduction method. In HPAT, single node MPI is much faster than OpenMP for most benchmarks (pi is a good example).

I suspect it's because of cache line ping ponging between threads since local results of threads are stored consecutively.

ehsantn avatar Mar 07 '16 04:03 ehsantn

Does changing back to OpenMP reduce help? Last time I measured the current implementation against using OpenMP reduce with some benchmarks, there was no practical difference on multi-core, not sure about Pi though.

ninegua avatar Mar 07 '16 18:03 ninegua

I see the same issue on all the benchmarks I have tested for HPAT. I'm working on testing OpenMP reduce on Cori now. I think we might have thread affinity issues on our machines.

ehsantn avatar Mar 07 '16 18:03 ehsantn

OpenMP reduce is similar in performance seems like. I don't know where this performance difference comes from.

ehsantn avatar Mar 07 '16 19:03 ehsantn

Are we going to do anything about this? If OpenMP performance is similar, I don't see there is an immediate remedy that can help.

ninegua avatar Apr 15 '16 16:04 ninegua

I think we need deeper performance analysis (with VTune?) to find out what the problem is.

ehsantn avatar Apr 15 '16 17:04 ehsantn