OpenROAD global placement slows down when more threads are added

Description

I'm seeing global place taking a lot of time and it appears to have something to do with multithreading not working ideally and that it gets MUCH slower as number of threads are increased.

This appears to get much worse in a context where lots of flows are running parallel on a machine and -threads 0 is used. Previous experiments showed that detailed routing processes where multiple processes used all threads did not particularly reduce the throughput(total running time) in build jobs, but here in global placement, there appears to be a very significant slowdown when many global placement processes all use all available threads.

To reproduce untar and run gpl-slow-multithreaded.tar.gz

I'm confused as to what is going on, because the running times don't make sense and they vary. I have seen wallcokc 40 vs. 18 seconds when running with 48 threads.

Measure for 1, 2, 4, 16, 32, 48 threads:

$ echo -e "Threads\tReal\tUser"; for t in 1 2 4 16 32 48; do echo -n "$t\t"; /usr/bin/time -f "%E\t%U" ./run-me-top-asap7-megaboom.sh -threads $t 2>&1 | tail -n 1; done

Two standalone threads, here 48 threads take 40 seconds vs. 18 seconds in my table above:

$ time ./run-me-top-asap7-megaboom.sh -threads 1
OpenROAD v2.0-20664-g359623a968 
[deleted]
real	0m14,574s
user	0m14,425s
sys	0m0,143s

$ time ./run-me-top-asap7-megaboom.sh -threads 48
OpenROAD v2.0-20664-g359623a968 
[deleted]
real	0m40,487s
user	24m0,317s
sys	0m5,748s

Overhead  Shared Object                            Symbol
  75,86%  libgomp.so.1.0.0                         [.] 0x0000000000020600                                                                                      ◆
  12,73%  libgomp.so.1.0.0                         [.] 0x00000000000207b8                                                                                      ▒
   2,52%  [kernel]                                 [k] 0xffffffffa46330a3                                                                                      ▒
   0,63%  openroad                                 [.] gpl::NesterovBase::getDensityGradient(gpl::GCell const*) const                                          ▒
   0,50%  openroad                                 [.] gpl::BinGrid::updateBinsGCellDensityArea(std::vector<gpl::GCellHandle, std::allocator<gpl::GCellHandle> ▒
   0,49%  openroad                                 [.] gpl::NesterovBase::updateDensityForceBin() [clone ._omp_fn.0]                                           ▒
   0,44%  libgomp.so.1.0.0                         [.] 0x000000000002060b                                                                                      ▒
   0,42%  openroad                                 [.] gpl::FFT::updateDensity(int, int, float)                                                                ▒
   0,41%  libgomp.so.1.0.0                         [.] 0x0000000000020606                                                                                      ▒
   0,33%  openroad                                 [.] gpl::FFT::getElectroForce(int, int) const                                                               ▒
   0,30%  openroad                                 [.] gpl::NesterovBase::updateDensityForceBin() [clone ._omp_fn.1]                                           ▒
   0,24%  [kernel]                                 [k] 0xffffffffa46327f1                                                                                      ▒
   0,23%  libgomp.so.1.0.0                         [.] 0x000000000002060d                                                                                      ▒
   0,21%  libgomp.so.1.0.0                         [.] 0x0000000000020602                                                                                      ▒

Additional Context

No response

Apr 16 '25 08:04 oharboe

@gudeh I would guess it is some of the omp parallel blocks that are over small amounts of work. Please try to narrow down where it happens.

Apr 16 '25 14:04 maliberty

I used your script to run nangate45/black_parrot which is larger with 255K instances versus the experimented design (megaboom) with 159 instances.

Threads	    Real	     User
1	    10:23.03	621.52
2	    10:22.82	621.30
4	    10:21.61	620.13
16	    10:18.22	616.77
32	    10:28.28	626.79
48	    10:33.97	632.40

I believe this makes sense because my machine has 8 cores:

:~/ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU @ 2.80GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   8

Although the gains are extremely low. I am going to investigate further, I noticed at least two OMP loops which are too simple.

Apr 28 '25 15:04 gudeh

I also ran nangate45/swerv with vtune, which also shows that we seem to do a poor job on multi-threading so far

Apr 28 '25 15:04 gudeh

I ran our test-CI with different threads and made a script to fetch gpl runtimes from logs. This are the runtimes with different thread counts for the 18 largest designs, they are ordered by the number of instances. I used a line plot to try and view a trend.

And here for all designs:

Considering this experiment, it seems 16 threads is best between 1, 8, 16, and 32 threads.

Apr 29 '25 12:04 gudeh

@gudeh @maliberty Considering build systems where lots of builds are running in parallel, what is the best number of threads?

2 threads would speed up things, but minimize thrashing?

I don't know what the units are above, minutes?

It could be worth sorting on ascending running time :-)

Apr 29 '25 12:04 oharboe

Reordered by running time. Units are in minutes.

Apr 29 '25 13:04 gudeh

Do these tests run alone in serial or are they running all at the same time on the same computer?

My concern is that the system performance fails when the system has too many threads from independent builds doing global placement or when global placement is competing with e.g. detailed routing.

Apr 29 '25 13:04 oharboe

Do these tests run alone in serial or are they running all at the same time on the same computer?

My concern is that the system performance fails when the system has too many threads from independent builds doing global placement or when global placement is competing with e.g. detailed routing.

That's a valid concern! I am not sure how it is run, will try to find out. Although, for this runs I did make place only.

Apr 29 '25 13:04 gudeh

Experiment showing effect on system of overloaded system:

echo -e "Instances\tReal\tUser"
for t in 1 2 4 8; do
    echo -n "$t\t"
    /usr/bin/time -f "%E\t%U" bash -c "
        for i in \$(seq 1 $t); do
            ./run-me-top-asap7-megaboom.sh -threads $(nproc) &
        done
        wait
    " 2>&1 | tail -n 1
done

We can see that the running time per instance increases significantly, so overloading the system, as expected, reduces build throughput:

Instances	User	Time per Instance (User s)
1	12.52	12.52
2	31.87	15.94
4	98.04	24.51
8	346.02	43.25

Apr 30 '25 14:04 oharboe

@gudeh What is your conclusion?

May 07 '25 10:05 oharboe

@gudeh What is your conclusion?

Hi @oharboe , I got really puzzled after running the CI test and 2 threads showed less runtime than other thread counting I did previously. 3, 4 and 5 presented almost the same result as 2 threads also.

Lately I was focusing on other gpl tasks. But I still want to understand what is going on.

May 07 '25 13:05 gudeh

@gudeh What is your conclusion?

Hi @oharboe , I got really puzzled after running the CI test and 2 threads showed less runtime than other thread counting I did previously. 3, 4 and 5 presented almost the same result as 2 threads also.

Lately I was focusing on other gpl tasks. But I still want to understand what is going on.

Measure twice and cut once as they say... I guess there's some thrashing of caches or CPU resources such that in a CI setting this can be hard to understand.

May 07 '25 13:05 oharboe

global placement slows down when more threads are added

Description

Suggested Solution

Additional Context