ParallelAccelerator.jl Reduction performance issue

Seems like there might be a performance issue with the new "manual" reduction method. In HPAT, single node MPI is much faster than OpenMP for most benchmarks (pi is a good example).

I suspect it's because of cache line ping ponging between threads since local results of threads are stored consecutively.

Mar 07 '16 04:03 ehsantn

Does changing back to OpenMP reduce help? Last time I measured the current implementation against using OpenMP reduce with some benchmarks, there was no practical difference on multi-core, not sure about Pi though.

Mar 07 '16 18:03 ninegua

I see the same issue on all the benchmarks I have tested for HPAT. I'm working on testing OpenMP reduce on Cori now. I think we might have thread affinity issues on our machines.

Mar 07 '16 18:03 ehsantn

OpenMP reduce is similar in performance seems like. I don't know where this performance difference comes from.

Mar 07 '16 19:03 ehsantn

Are we going to do anything about this? If OpenMP performance is similar, I don't see there is an immediate remedy that can help.

Apr 15 '16 16:04 ninegua

I think we need deeper performance analysis (with VTune?) to find out what the problem is.

Apr 15 '16 17:04 ehsantn