CARLsim4 replace pthreads with openmp

Jul 30 '17 07:07 mode89

Changes Unknown when pulling 1843574141fc118f9ed52cd647353eed424b114a on mode89:feat/openmp into ** on UCI-CARL:master**.

Jul 30 '17 07:07 coveralls

Why do you want to replace pthreads with openmp? pthreads is lower level multi-threading API and can hard assign threads to cores and that is something we want.

Create two new branches:

feat/benchmarkOpenmp from this branch and copy benchmark4 from feat/benchmark
feat/benchmarkpthreads from this master and copy benchmark4 from feat/benchmark

Run benchmark4 on both the two branches using mulicore machines (1, 4, 8, 16, 32 cores) and share the comparison. copy @tingshuc

Jul 30 '17 17:07 hkashyap

Hi, @hkashyap. I'll do benchmarking.

I did the replacement because:

openmp based implementation has a cleaner code;
cross-platform;
didn't find the way how to break into pthreads thread in vscode on linux.

Jul 31 '17 02:07 mode89

Hi, @hkashyap

I ran benchmark4 on a machine with 8 logical cores. I didn't want to wait too much time, that's why I ran simulations with 300 and 400 synapses only. Here is run_benchmark4 that I used.

Here is the summary:

OpenMP

Partitions	Synapses	Setup time	Run time
1	300	74703	461210
1	400	84604	572805
2	300	72605	242761
2	400	82895	301566
4	300	70685	155766
4	400	81047	191754
8	300	62028	106303
8	400	77644	131458
16	300	59296	91215
16	400	75889	114351

Output and record.csv files.

pthreads

Partitions	Synapses	Setup time	Run time
1	300	75792	259754
1	400	80685	323523
2	300	73531	166340
2	400	83039	206210
4	300	72645	121236
4	400	80364	152389
8	300	69844	132340
8	400	72541	170894
16	300	64541	158897
16	400	73202	196628

Output and record.csv files.

I've created branches feat/benchmark-openmp and feat/benchmark-pthreads.

Jul 31 '17 06:07 mode89

Hi Andrew @mode89, First of all, thanks for helping out. The result is quite interesting. We'll double check the computing results and discuss if we just move to openmp. In the mean time, please let us know the best e-mail for contacting you. We are writing CARLsim4 paper and I think we should at least acknowledge your contribution.

Jul 31 '17 06:07 tingshuc

Hi Ting-Shuo @tingshuc

Yeah, the results are interesting. On Saturday I tried to run some simulations on a 4-core machine and as far as I remember OpenMP's implementation outperformed pthreads. Later I will run the benchmark on that machine and share the results with you.

I've implemented building with OpenMP in CMake only. If you plan on using Make for building, then you will probably have to enable support of OpenMP by passing some additional compiler/linker flags. If you use GCC and default GNU's OpenMP implementation than the compiler flag should be -fopenmp and linker's flag is -lgomp. On other toolchains the flags should be different.

Thank you for acknowlegment. I can be contacted via [email protected]

Jul 31 '17 07:07 mode89

Changes Unknown when pulling 1843574141fc118f9ed52cd647353eed424b114a on mode89:feat/openmp into ** on UCI-CARL:master**.

Jul 31 '17 17:07 coveralls

Hi @mode89 thank you for the benchmark comparison. We will double check them using 500 and 600 synapses and with more cores (on a cluster) when we get some time. We will definitely do an analysis.

Aug 01 '17 02:08 hkashyap

Talking about cluster, we run these multicore simulations on clusters with many nodes, not on one node with many cores. Using openmp will mean that we will be restricted to single node. If we really need to use many nodes on a HPC setting, don't we need something like MPI.

Aug 01 '17 03:08 hkashyap

Actually we need OpenMP + MPI. If this is something @mode89 would help, we can jump into CARLsim5.

Aug 01 '17 03:08 tingshuc

Hey guys.

I've never got my hands on MPI, but it sounds interesting and I'm keen on helping you with this.

Hirak @hkashyap yes, you are right that OpenMP is bound to a single node, but I think it can work in conjunction with MPI, when MPI launches a single process per a node and each process parallels jobs through all of the node's CPUs using OpenMP.

Aug 01 '17 06:08 mode89

Here are the results of running benchmark4 on Intel Core i5-4210U with 4 logical cores.

Partitions   Synapses    Start time          Run time
                         OpenMP  Pthreads    OpenMP  Pthreads

1            100         23609   24318       155879  138548
             200         27572   28293       209926  197093
             300         31496   32510       269595  260106
             400         35558   36965       332930  326105
             500         40029   41647       394867  392823
             600         44297   45821       456115  456416

2            100         23483   23508       107391  123676
             200         26926   27563       142489  176900
             300         30862   31566       185877  231452
             400         34732   35655       226305  289984
             500         39243   39992       264346  346974
             600         43062   44348       307221  406263

4            100         22705   23011       87710   87466
             200         26339   27434       125000  126202
             300         30199   30719       160702  162898
             400         33768   34543       204995  201963
             500         37778   38578       243666  243010
             600         42162   42494       279010  284144

8            100         22257   22556       86465   91126
             200         25721   26193       120797  128545
             300         29192   29783       161095  172993
             400         32743   33737       194389  213762
             500         36507   37364       244109  248595
             600         40175   41572       268033  288087

16           100         21943   22150       84459   104359
             200         25443   25581       118988  140627
             300         29239   29094       155669  177737
             400         32016   32612       194259  224579
             500         35622   36251       234972  259011
             600         39170   39866       262310  298025

Output for OpenMP and pthreads record.csv for OpenMP and pthreads

Aug 01 '17 15:08 mode89

Hi @mode89 what do you mean by running 16 core simulations using 4 logical cores? You need 16 physical cores (on one node in case of OpenMP) to run the SNN simulations on 16 cores.

Aug 02 '17 05:08 hkashyap

Hi @hkashyap I meant 4 logical cores of CPU. And word "cores" in the table means the number which is passed to benchmark executable: it defines amount of used partitions. In run_benchmark4 script it's referenced as "number of cores". I've replaced word "Cores" with "Partitions" in the tables above.

Aug 02 '17 08:08 mode89

My personal opinion on this numbers is that the difference is minor. And the bigger amount of neurons we have, the smaller the difference is. Actually, pthread is quite low-level and I believe pthread-based CARLsim can be optimized to meet performance of OpenMP. I think, GNU's OpenMP even based on pthreads under the hood, because libgomp is linked against libpthread.

My main concerns were readability and maintainability. Pthread-based implementation requires more code and more variables to keep track of, and additional helper functions. Actually, from the statistics of this pull request we can see 100 lines against 800 lines. Plus, OpenMP is cross-platform and all popular compilers support it out-of-the-box.

Aug 02 '17 09:08 mode89

@mode89 now I see what's going on here. My best guess is that your CPU has 4 core level parallelization. So you are not improving anything beyond 4 cores. OpenMP automatically assigns to these four cores for any number of runtimes >= 4. On the otherhand, since with pthreads we manually try to hard assign 8/16 threads over four cores, the performance goes down.

Anyway, as I already explained the need to assign to multiple physical cores on multiple cluster nodes, which is the main focus above all. Thank you for detailing.

Aug 04 '17 01:08 hkashyap

Hi @mode89, I am finalizing this pull request by comparing against pthreads. Sorry for the delay, as I had a very busy summer. I have two questions:

Did you run the simulations on Windows? If not, I will seek help from @tingshuc
I re-ran benchmark4 on your openmp branch on a cluster node with 60 cores and received no improvement beyond 4 cores. What may be the reason for this? Performance did improve in case of pthreads, which is still running.

Openmp:

number of cores is 1 | | | | 2000 | 100 | 5 | 32346 | 205685 2000 | 200 | 5 | 41421 | 294648 2000 | 300 | 5 | 50009 | 386893 2000 | 400 | 5 | 60023 | 489918 2000 | 500 | 5 | 67978 | 592517 2000 | 600 | 5 | 77117 | 684760 number of cores is 2 | | | | 2000 | 100 | 5 | 31053 | 222891 2000 | 200 | 5 | 39513 | 330809 2000 | 300 | 5 | 47618 | 434174 2000 | 400 | 5 | 55777 | 541242 2000 | 500 | 5 | 64268 | 654405 2000 | 600 | 4 | 72789 | 763254 number of cores is 4 | | | | 2000 | 100 | 5 | 29730 | 227001 2000 | 200 | 5 | 37374 | 329766 2000 | 300 | 5 | 45064 | 437332 2000 | 400 | 5 | 53110 | 548943 2000 | 500 | 5 | 60992 | 654377 2000 | 600 | 5 | 69426 | 765715 number of cores is 8 | | | | 2000 | 100 | 5 | 29310 | 227971 2000 | 200 | 5 | 36186 | 331715 2000 | 300 | 4 | 43697 | 437471 2000 | 400 | 4 | 51079 | 551749 2000 | 500 | 4 | 58638 | 665500 2000 | 600 | 4 | 65833 | 782450 number of cores is 16 | | | | 2000 | 100 | 4 | 28665 | 228726 2000 | 200 | 5 | 35377 | 329576 2000 | 300 | 4 | 42417 | 444320 2000 | 400 | 5 | 49668 | 553423 2000 | 500 | 4 | 56639 | 664213 2000 | 600 | 6 | 63710 | 785869 number of cores is 32 | | | | 2000 | 100 | 6 | 28259 | 229111 2000 | 200 | 4 | 34707 | 331572 2000 | 300 | 5 | 41544 | 439790 2000 | 400 | 5 | 48300 | 553787 2000 | 500 | 5 | 55451 | 666177 2000 | 600 | 5 | 62638 | 780040

pthreads:

number of cores is 1 | | | | 2000 | 100 | 5 | 23949 | 243003 2000 | 200 | 5 | 30291 | 337753 2000 | 300 | 4 | 36714 | 528196 2000 | 400 | 5 | 59039 | 602546 2000 | 500 | 6 | 65720 | 718506 2000 | 600 | 5 | 75035 | 1014244 number of cores is 2 | | | | 2000 | 100 | 4 | 30486 | 433953 2000 | 200 | 5 | 38698 | 564945 2000 | 300 | 5 | 46680 | 719083 2000 | 400 | 11 | 54868 | 874514 2000 | 500 | 5 | 63284 | 1029743 2000 | 600 | 6 | 71487 | 1042554 number of cores is 4 | | | | 2000 | 100 | 5 | 29468 | 294977 2000 | 200 | 5 | 37193 | 386312 2000 | 300 | 6 | 44585 | 453801 2000 | 400 | 5 | 52434 | 504584 2000 | 500 | 5 | 60108 | 558175 2000 | 600 | 5 | 67862 | 642926 number of cores is 8 | | | | 2000 | 100 | 5 | 28790 | 239215 2000 | 200 | 4 | 35738 | 330791 2000 | 300 | 5 | 43251 | 391847 2000 | 400 | 5 | 50363 | 464413 2000 | 500 | 5 | 57080 | 476765 2000 | 600 | 5 | 64221 | 543255 number of cores is 16 | | | | 2000 | 100 | 5 | 27675 | 233491 2000 | 200 | 5 | 34664 | 291623 2000 | 300 | 5 | 41339 | 368996 2000 | 400 | 5 | 47832 | 420652 2000 | 500 | 6 | 55003 | 460946 2000 | 600 | 5 | 62438 | 495215

Sep 26 '17 05:09 hkashyap

Hi @hkashyap, no problem!

I haven't ran the simulation on Windows. Visual Studio supports only OpenMP 2.0, which lacks #pragma omp task directive. I think it should be possible to replace it with #pragma omp section directive.
As we can see this time OpenMP performance doesn't change with increasing number of partitions, as opposed to the previous posts when I ran it on 8-core CPU and 4-core CPU, where run time changed drastically between 1-partition and 2-partitions configurations. My guess, is that OpenMP might be disabled. Did you use Make or CMake to build the project? I haven't changed Make scripts. GCC Compiler requires to pass the option -fopenmp to generate OpenMP compatible code, and GCC linker requires to link against GOMP library with the option -lgomp. Other toolchains, e.g. Visual Studio, require other flags. I think you can debug it by printing output of function omp_get_thread_num or omp_get_num_threads somewhere inside #pragma omp blocks.

Sep 26 '17 07:09 mode89

CARLsim4 CARLsim4 copied to clipboard

replace pthreads with openmp

CARLsim4
CARLsim4 copied to clipboard