CARLsim4
CARLsim4 copied to clipboard
replace pthreads with openmp
Changes Unknown when pulling 1843574141fc118f9ed52cd647353eed424b114a on mode89:feat/openmp into ** on UCI-CARL:master**.
Why do you want to replace pthreads with openmp? pthreads is lower level multi-threading API and can hard assign threads to cores and that is something we want.
Create two new branches:
- feat/benchmarkOpenmp from this branch and copy benchmark4 from feat/benchmark
- feat/benchmarkpthreads from this master and copy benchmark4 from feat/benchmark
Run benchmark4 on both the two branches using mulicore machines (1, 4, 8, 16, 32 cores) and share the comparison. copy @tingshuc
Hi, @hkashyap. I'll do benchmarking.
I did the replacement because:
- openmp based implementation has a cleaner code;
- cross-platform;
- didn't find the way how to break into pthreads thread in vscode on linux.
Hi, @hkashyap
I ran benchmark4 on a machine with 8 logical cores. I didn't want to wait too much time, that's why I ran simulations with 300 and 400 synapses only. Here is run_benchmark4 that I used.
Here is the summary:
OpenMP
Partitions | Synapses | Setup time | Run time |
---|---|---|---|
1 | 300 | 74703 | 461210 |
1 | 400 | 84604 | 572805 |
2 | 300 | 72605 | 242761 |
2 | 400 | 82895 | 301566 |
4 | 300 | 70685 | 155766 |
4 | 400 | 81047 | 191754 |
8 | 300 | 62028 | 106303 |
8 | 400 | 77644 | 131458 |
16 | 300 | 59296 | 91215 |
16 | 400 | 75889 | 114351 |
Output and record.csv files.
pthreads
Partitions | Synapses | Setup time | Run time |
---|---|---|---|
1 | 300 | 75792 | 259754 |
1 | 400 | 80685 | 323523 |
2 | 300 | 73531 | 166340 |
2 | 400 | 83039 | 206210 |
4 | 300 | 72645 | 121236 |
4 | 400 | 80364 | 152389 |
8 | 300 | 69844 | 132340 |
8 | 400 | 72541 | 170894 |
16 | 300 | 64541 | 158897 |
16 | 400 | 73202 | 196628 |
Output and record.csv files.
I've created branches feat/benchmark-openmp and feat/benchmark-pthreads.
Hi Andrew @mode89, First of all, thanks for helping out. The result is quite interesting. We'll double check the computing results and discuss if we just move to openmp. In the mean time, please let us know the best e-mail for contacting you. We are writing CARLsim4 paper and I think we should at least acknowledge your contribution.
Hi Ting-Shuo @tingshuc
Yeah, the results are interesting. On Saturday I tried to run some simulations on a 4-core machine and as far as I remember OpenMP's implementation outperformed pthreads. Later I will run the benchmark on that machine and share the results with you.
I've implemented building with OpenMP in CMake only. If you plan on using Make for building, then you will probably have to enable support of OpenMP by passing some additional compiler/linker flags. If you use GCC and default GNU's OpenMP implementation than the compiler flag should be -fopenmp
and linker's flag is -lgomp
. On other toolchains the flags should be different.
Thank you for acknowlegment. I can be contacted via [email protected]
Changes Unknown when pulling 1843574141fc118f9ed52cd647353eed424b114a on mode89:feat/openmp into ** on UCI-CARL:master**.
Hi @mode89 thank you for the benchmark comparison. We will double check them using 500 and 600 synapses and with more cores (on a cluster) when we get some time. We will definitely do an analysis.
Talking about cluster, we run these multicore simulations on clusters with many nodes, not on one node with many cores. Using openmp will mean that we will be restricted to single node. If we really need to use many nodes on a HPC setting, don't we need something like MPI.
Actually we need OpenMP + MPI. If this is something @mode89 would help, we can jump into CARLsim5.
Hey guys.
I've never got my hands on MPI, but it sounds interesting and I'm keen on helping you with this.
Hirak @hkashyap yes, you are right that OpenMP is bound to a single node, but I think it can work in conjunction with MPI, when MPI launches a single process per a node and each process parallels jobs through all of the node's CPUs using OpenMP.
Here are the results of running benchmark4 on Intel Core i5-4210U with 4 logical cores.
Partitions Synapses Start time Run time
OpenMP Pthreads OpenMP Pthreads
1 100 23609 24318 155879 138548
200 27572 28293 209926 197093
300 31496 32510 269595 260106
400 35558 36965 332930 326105
500 40029 41647 394867 392823
600 44297 45821 456115 456416
2 100 23483 23508 107391 123676
200 26926 27563 142489 176900
300 30862 31566 185877 231452
400 34732 35655 226305 289984
500 39243 39992 264346 346974
600 43062 44348 307221 406263
4 100 22705 23011 87710 87466
200 26339 27434 125000 126202
300 30199 30719 160702 162898
400 33768 34543 204995 201963
500 37778 38578 243666 243010
600 42162 42494 279010 284144
8 100 22257 22556 86465 91126
200 25721 26193 120797 128545
300 29192 29783 161095 172993
400 32743 33737 194389 213762
500 36507 37364 244109 248595
600 40175 41572 268033 288087
16 100 21943 22150 84459 104359
200 25443 25581 118988 140627
300 29239 29094 155669 177737
400 32016 32612 194259 224579
500 35622 36251 234972 259011
600 39170 39866 262310 298025
Output for OpenMP and pthreads record.csv for OpenMP and pthreads
Hi @mode89 what do you mean by running 16 core simulations using 4 logical cores? You need 16 physical cores (on one node in case of OpenMP) to run the SNN simulations on 16 cores.
Hi @hkashyap I meant 4 logical cores of CPU. And word "cores" in the table means the number which is passed to benchmark executable: it defines amount of used partitions. In run_benchmark4 script it's referenced as "number of cores". I've replaced word "Cores" with "Partitions" in the tables above.
My personal opinion on this numbers is that the difference is minor. And the bigger amount of neurons we have, the smaller the difference is. Actually, pthread is quite low-level and I believe pthread-based CARLsim can be optimized to meet performance of OpenMP. I think, GNU's OpenMP even based on pthreads under the hood, because libgomp is linked against libpthread.
My main concerns were readability and maintainability. Pthread-based implementation requires more code and more variables to keep track of, and additional helper functions. Actually, from the statistics of this pull request we can see 100 lines against 800 lines. Plus, OpenMP is cross-platform and all popular compilers support it out-of-the-box.
@mode89 now I see what's going on here. My best guess is that your CPU has 4 core level parallelization. So you are not improving anything beyond 4 cores. OpenMP automatically assigns to these four cores for any number of runtimes >= 4. On the otherhand, since with pthreads we manually try to hard assign 8/16 threads over four cores, the performance goes down.
Anyway, as I already explained the need to assign to multiple physical cores on multiple cluster nodes, which is the main focus above all. Thank you for detailing.
Hi @mode89, I am finalizing this pull request by comparing against pthreads. Sorry for the delay, as I had a very busy summer. I have two questions:
-
Did you run the simulations on Windows? If not, I will seek help from @tingshuc
-
I re-ran benchmark4 on your openmp branch on a cluster node with 60 cores and received no improvement beyond 4 cores. What may be the reason for this? Performance did improve in case of pthreads, which is still running.
Openmp:
number of cores is 1 | | | | 2000 | 100 | 5 | 32346 | 205685 2000 | 200 | 5 | 41421 | 294648 2000 | 300 | 5 | 50009 | 386893 2000 | 400 | 5 | 60023 | 489918 2000 | 500 | 5 | 67978 | 592517 2000 | 600 | 5 | 77117 | 684760 number of cores is 2 | | | | 2000 | 100 | 5 | 31053 | 222891 2000 | 200 | 5 | 39513 | 330809 2000 | 300 | 5 | 47618 | 434174 2000 | 400 | 5 | 55777 | 541242 2000 | 500 | 5 | 64268 | 654405 2000 | 600 | 4 | 72789 | 763254 number of cores is 4 | | | | 2000 | 100 | 5 | 29730 | 227001 2000 | 200 | 5 | 37374 | 329766 2000 | 300 | 5 | 45064 | 437332 2000 | 400 | 5 | 53110 | 548943 2000 | 500 | 5 | 60992 | 654377 2000 | 600 | 5 | 69426 | 765715 number of cores is 8 | | | | 2000 | 100 | 5 | 29310 | 227971 2000 | 200 | 5 | 36186 | 331715 2000 | 300 | 4 | 43697 | 437471 2000 | 400 | 4 | 51079 | 551749 2000 | 500 | 4 | 58638 | 665500 2000 | 600 | 4 | 65833 | 782450 number of cores is 16 | | | | 2000 | 100 | 4 | 28665 | 228726 2000 | 200 | 5 | 35377 | 329576 2000 | 300 | 4 | 42417 | 444320 2000 | 400 | 5 | 49668 | 553423 2000 | 500 | 4 | 56639 | 664213 2000 | 600 | 6 | 63710 | 785869 number of cores is 32 | | | | 2000 | 100 | 6 | 28259 | 229111 2000 | 200 | 4 | 34707 | 331572 2000 | 300 | 5 | 41544 | 439790 2000 | 400 | 5 | 48300 | 553787 2000 | 500 | 5 | 55451 | 666177 2000 | 600 | 5 | 62638 | 780040
pthreads:
number of cores is 1 | | | | 2000 | 100 | 5 | 23949 | 243003 2000 | 200 | 5 | 30291 | 337753 2000 | 300 | 4 | 36714 | 528196 2000 | 400 | 5 | 59039 | 602546 2000 | 500 | 6 | 65720 | 718506 2000 | 600 | 5 | 75035 | 1014244 number of cores is 2 | | | | 2000 | 100 | 4 | 30486 | 433953 2000 | 200 | 5 | 38698 | 564945 2000 | 300 | 5 | 46680 | 719083 2000 | 400 | 11 | 54868 | 874514 2000 | 500 | 5 | 63284 | 1029743 2000 | 600 | 6 | 71487 | 1042554 number of cores is 4 | | | | 2000 | 100 | 5 | 29468 | 294977 2000 | 200 | 5 | 37193 | 386312 2000 | 300 | 6 | 44585 | 453801 2000 | 400 | 5 | 52434 | 504584 2000 | 500 | 5 | 60108 | 558175 2000 | 600 | 5 | 67862 | 642926 number of cores is 8 | | | | 2000 | 100 | 5 | 28790 | 239215 2000 | 200 | 4 | 35738 | 330791 2000 | 300 | 5 | 43251 | 391847 2000 | 400 | 5 | 50363 | 464413 2000 | 500 | 5 | 57080 | 476765 2000 | 600 | 5 | 64221 | 543255 number of cores is 16 | | | | 2000 | 100 | 5 | 27675 | 233491 2000 | 200 | 5 | 34664 | 291623 2000 | 300 | 5 | 41339 | 368996 2000 | 400 | 5 | 47832 | 420652 2000 | 500 | 6 | 55003 | 460946 2000 | 600 | 5 | 62438 | 495215
Hi @hkashyap, no problem!
- I haven't ran the simulation on Windows. Visual Studio supports only OpenMP 2.0, which lacks
#pragma omp task
directive. I think it should be possible to replace it with#pragma omp section
directive. - As we can see this time OpenMP performance doesn't change with increasing number of partitions, as opposed to the previous posts when I ran it on 8-core CPU and 4-core CPU, where run time changed drastically between 1-partition and 2-partitions configurations. My guess, is that OpenMP might be disabled. Did you use Make or CMake to build the project? I haven't changed Make scripts. GCC Compiler requires to pass the option
-fopenmp
to generate OpenMP compatible code, and GCC linker requires to link against GOMP library with the option-lgomp
. Other toolchains, e.g. Visual Studio, require other flags. I think you can debug it by printing output of functionomp_get_thread_num
oromp_get_num_threads
somewhere inside#pragma omp
blocks.