global placement slows down when more threads are added
Description
I'm seeing global place taking a lot of time and it appears to have something to do with multithreading not working ideally and that it gets MUCH slower as number of threads are increased.
This appears to get much worse in a context where lots of flows are running parallel on a machine and -threads 0 is used. Previous experiments showed that detailed routing processes where multiple processes used all threads did not particularly reduce the throughput(total running time) in build jobs, but here in global placement, there appears to be a very significant slowdown when many global placement processes all use all available threads.
To reproduce untar and run gpl-slow-multithreaded.tar.gz
I'm confused as to what is going on, because the running times don't make sense and they vary. I have seen wallcokc 40 vs. 18 seconds when running with 48 threads.
Measure for 1, 2, 4, 16, 32, 48 threads:
$ echo -e "Threads\tReal\tUser"; for t in 1 2 4 16 32 48; do echo -n "$t\t"; /usr/bin/time -f "%E\t%U" ./run-me-top-asap7-megaboom.sh -threads $t 2>&1 | tail -n 1; done
Two standalone threads, here 48 threads take 40 seconds vs. 18 seconds in my table above:
$ time ./run-me-top-asap7-megaboom.sh -threads 1
OpenROAD v2.0-20664-g359623a968
[deleted]
real 0m14,574s
user 0m14,425s
sys 0m0,143s
$ time ./run-me-top-asap7-megaboom.sh -threads 48
OpenROAD v2.0-20664-g359623a968
[deleted]
real 0m40,487s
user 24m0,317s
sys 0m5,748s
Overhead Shared Object Symbol
75,86% libgomp.so.1.0.0 [.] 0x0000000000020600 ◆
12,73% libgomp.so.1.0.0 [.] 0x00000000000207b8 ▒
2,52% [kernel] [k] 0xffffffffa46330a3 ▒
0,63% openroad [.] gpl::NesterovBase::getDensityGradient(gpl::GCell const*) const ▒
0,50% openroad [.] gpl::BinGrid::updateBinsGCellDensityArea(std::vector<gpl::GCellHandle, std::allocator<gpl::GCellHandle> ▒
0,49% openroad [.] gpl::NesterovBase::updateDensityForceBin() [clone ._omp_fn.0] ▒
0,44% libgomp.so.1.0.0 [.] 0x000000000002060b ▒
0,42% openroad [.] gpl::FFT::updateDensity(int, int, float) ▒
0,41% libgomp.so.1.0.0 [.] 0x0000000000020606 ▒
0,33% openroad [.] gpl::FFT::getElectroForce(int, int) const ▒
0,30% openroad [.] gpl::NesterovBase::updateDensityForceBin() [clone ._omp_fn.1] ▒
0,24% [kernel] [k] 0xffffffffa46327f1 ▒
0,23% libgomp.so.1.0.0 [.] 0x000000000002060d ▒
0,21% libgomp.so.1.0.0 [.] 0x0000000000020602 ▒
Suggested Solution
Figure out what is going on and use the optimal number of threads
Additional Context
No response
@gudeh I would guess it is some of the omp parallel blocks that are over small amounts of work. Please try to narrow down where it happens.
I used your script to run nangate45/black_parrot which is larger with 255K instances versus the experimented design (megaboom) with 159 instances.
Threads Real User
1 10:23.03 621.52
2 10:22.82 621.30
4 10:21.61 620.13
16 10:18.22 616.77
32 10:28.28 626.79
48 10:33.97 632.40
I believe this makes sense because my machine has 8 cores:
:~/ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU @ 2.80GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 8
Although the gains are extremely low. I am going to investigate further, I noticed at least two OMP loops which are too simple.
I also ran nangate45/swerv with vtune, which also shows that we seem to do a poor job on multi-threading so far
I ran our test-CI with different threads and made a script to fetch gpl runtimes from logs. This are the runtimes with different thread counts for the 18 largest designs, they are ordered by the number of instances. I used a line plot to try and view a trend.
And here for all designs:
Considering this experiment, it seems 16 threads is best between 1, 8, 16, and 32 threads.
@gudeh @maliberty Considering build systems where lots of builds are running in parallel, what is the best number of threads?
2 threads would speed up things, but minimize thrashing?
I don't know what the units are above, minutes?
It could be worth sorting on ascending running time :-)
Reordered by running time. Units are in minutes.
Do these tests run alone in serial or are they running all at the same time on the same computer?
My concern is that the system performance fails when the system has too many threads from independent builds doing global placement or when global placement is competing with e.g. detailed routing.
Do these tests run alone in serial or are they running all at the same time on the same computer?
My concern is that the system performance fails when the system has too many threads from independent builds doing global placement or when global placement is competing with e.g. detailed routing.
That's a valid concern! I am not sure how it is run, will try to find out. Although, for this runs I did make place only.
Experiment showing effect on system of overloaded system:
echo -e "Instances\tReal\tUser"
for t in 1 2 4 8; do
echo -n "$t\t"
/usr/bin/time -f "%E\t%U" bash -c "
for i in \$(seq 1 $t); do
./run-me-top-asap7-megaboom.sh -threads $(nproc) &
done
wait
" 2>&1 | tail -n 1
done
We can see that the running time per instance increases significantly, so overloading the system, as expected, reduces build throughput:
| Instances | User | Time per Instance (User s) |
|---|---|---|
| 1 | 12.52 | 12.52 |
| 2 | 31.87 | 15.94 |
| 4 | 98.04 | 24.51 |
| 8 | 346.02 | 43.25 |
@gudeh What is your conclusion?
@gudeh What is your conclusion?
Hi @oharboe , I got really puzzled after running the CI test and 2 threads showed less runtime than other thread counting I did previously. 3, 4 and 5 presented almost the same result as 2 threads also.
Lately I was focusing on other gpl tasks. But I still want to understand what is going on.
@gudeh What is your conclusion?
Hi @oharboe , I got really puzzled after running the CI test and 2 threads showed less runtime than other thread counting I did previously. 3, 4 and 5 presented almost the same result as 2 threads also.
Lately I was focusing on other gpl tasks. But I still want to understand what is going on.
Measure twice and cut once as they say... I guess there's some thrashing of caches or CPU resources such that in a CI setting this can be hard to understand.