OpenROAD icon indicating copy to clipboard operation
OpenROAD copied to clipboard

global placement slows down when more threads are added

Open oharboe opened this issue 8 months ago • 12 comments

Description

I'm seeing global place taking a lot of time and it appears to have something to do with multithreading not working ideally and that it gets MUCH slower as number of threads are increased.

This appears to get much worse in a context where lots of flows are running parallel on a machine and -threads 0 is used. Previous experiments showed that detailed routing processes where multiple processes used all threads did not particularly reduce the throughput(total running time) in build jobs, but here in global placement, there appears to be a very significant slowdown when many global placement processes all use all available threads.

To reproduce untar and run gpl-slow-multithreaded.tar.gz

I'm confused as to what is going on, because the running times don't make sense and they vary. I have seen wallcokc 40 vs. 18 seconds when running with 48 threads.

Measure for 1, 2, 4, 16, 32, 48 threads:

$ echo -e "Threads\tReal\tUser"; for t in 1 2 4 16 32 48; do echo -n "$t\t"; /usr/bin/time -f "%E\t%U" ./run-me-top-asap7-megaboom.sh -threads $t 2>&1 | tail -n 1; done

Image

Two standalone threads, here 48 threads take 40 seconds vs. 18 seconds in my table above:

$ time ./run-me-top-asap7-megaboom.sh -threads 1
OpenROAD v2.0-20664-g359623a968 
[deleted]
real	0m14,574s
user	0m14,425s
sys	0m0,143s
$ time ./run-me-top-asap7-megaboom.sh -threads 48
OpenROAD v2.0-20664-g359623a968 
[deleted]
real	0m40,487s
user	24m0,317s
sys	0m5,748s
Overhead  Shared Object                            Symbol
  75,86%  libgomp.so.1.0.0                         [.] 0x0000000000020600                                                                                      ◆
  12,73%  libgomp.so.1.0.0                         [.] 0x00000000000207b8                                                                                      ▒
   2,52%  [kernel]                                 [k] 0xffffffffa46330a3                                                                                      ▒
   0,63%  openroad                                 [.] gpl::NesterovBase::getDensityGradient(gpl::GCell const*) const                                          ▒
   0,50%  openroad                                 [.] gpl::BinGrid::updateBinsGCellDensityArea(std::vector<gpl::GCellHandle, std::allocator<gpl::GCellHandle> ▒
   0,49%  openroad                                 [.] gpl::NesterovBase::updateDensityForceBin() [clone ._omp_fn.0]                                           ▒
   0,44%  libgomp.so.1.0.0                         [.] 0x000000000002060b                                                                                      ▒
   0,42%  openroad                                 [.] gpl::FFT::updateDensity(int, int, float)                                                                ▒
   0,41%  libgomp.so.1.0.0                         [.] 0x0000000000020606                                                                                      ▒
   0,33%  openroad                                 [.] gpl::FFT::getElectroForce(int, int) const                                                               ▒
   0,30%  openroad                                 [.] gpl::NesterovBase::updateDensityForceBin() [clone ._omp_fn.1]                                           ▒
   0,24%  [kernel]                                 [k] 0xffffffffa46327f1                                                                                      ▒
   0,23%  libgomp.so.1.0.0                         [.] 0x000000000002060d                                                                                      ▒
   0,21%  libgomp.so.1.0.0                         [.] 0x0000000000020602                                                                                      ▒

Image

Suggested Solution

Figure out what is going on and use the optimal number of threads

Additional Context

No response

oharboe avatar Apr 16 '25 08:04 oharboe

@gudeh I would guess it is some of the omp parallel blocks that are over small amounts of work. Please try to narrow down where it happens.

maliberty avatar Apr 16 '25 14:04 maliberty

I used your script to run nangate45/black_parrot which is larger with 255K instances versus the experimented design (megaboom) with 159 instances.

Threads	    Real	     User
1	    10:23.03	621.52
2	    10:22.82	621.30
4	    10:21.61	620.13
16	    10:18.22	616.77
32	    10:28.28	626.79
48	    10:33.97	632.40

I believe this makes sense because my machine has 8 cores:

:~/ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU @ 2.80GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   8

Although the gains are extremely low. I am going to investigate further, I noticed at least two OMP loops which are too simple.

gudeh avatar Apr 28 '25 15:04 gudeh

I also ran nangate45/swerv with vtune, which also shows that we seem to do a poor job on multi-threading so far

Image

gudeh avatar Apr 28 '25 15:04 gudeh

I ran our test-CI with different threads and made a script to fetch gpl runtimes from logs. This are the runtimes with different thread counts for the 18 largest designs, they are ordered by the number of instances. I used a line plot to try and view a trend. Image

And here for all designs: Image

Considering this experiment, it seems 16 threads is best between 1, 8, 16, and 32 threads.

gudeh avatar Apr 29 '25 12:04 gudeh

@gudeh @maliberty Considering build systems where lots of builds are running in parallel, what is the best number of threads?

2 threads would speed up things, but minimize thrashing?

I don't know what the units are above, minutes?

It could be worth sorting on ascending running time :-)

oharboe avatar Apr 29 '25 12:04 oharboe

Reordered by running time. Units are in minutes.

Image

Image

gudeh avatar Apr 29 '25 13:04 gudeh

Do these tests run alone in serial or are they running all at the same time on the same computer?

My concern is that the system performance fails when the system has too many threads from independent builds doing global placement or when global placement is competing with e.g. detailed routing.

oharboe avatar Apr 29 '25 13:04 oharboe

Do these tests run alone in serial or are they running all at the same time on the same computer?

My concern is that the system performance fails when the system has too many threads from independent builds doing global placement or when global placement is competing with e.g. detailed routing.

That's a valid concern! I am not sure how it is run, will try to find out. Although, for this runs I did make place only.

gudeh avatar Apr 29 '25 13:04 gudeh

Experiment showing effect on system of overloaded system:

echo -e "Instances\tReal\tUser"
for t in 1 2 4 8; do
    echo -n "$t\t"
    /usr/bin/time -f "%E\t%U" bash -c "
        for i in \$(seq 1 $t); do
            ./run-me-top-asap7-megaboom.sh -threads $(nproc) &
        done
        wait
    " 2>&1 | tail -n 1
done

We can see that the running time per instance increases significantly, so overloading the system, as expected, reduces build throughput:

Instances User Time per Instance (User s)
1 12.52 12.52
2 31.87 15.94
4 98.04 24.51
8 346.02 43.25

oharboe avatar Apr 30 '25 14:04 oharboe

@gudeh What is your conclusion?

oharboe avatar May 07 '25 10:05 oharboe

@gudeh What is your conclusion?

Hi @oharboe , I got really puzzled after running the CI test and 2 threads showed less runtime than other thread counting I did previously. 3, 4 and 5 presented almost the same result as 2 threads also.

Lately I was focusing on other gpl tasks. But I still want to understand what is going on.

gudeh avatar May 07 '25 13:05 gudeh

@gudeh What is your conclusion?

Hi @oharboe , I got really puzzled after running the CI test and 2 threads showed less runtime than other thread counting I did previously. 3, 4 and 5 presented almost the same result as 2 threads also.

Lately I was focusing on other gpl tasks. But I still want to understand what is going on.

Measure twice and cut once as they say... I guess there's some thrashing of caches or CPU resources such that in a CI setting this can be hard to understand.

oharboe avatar May 07 '25 13:05 oharboe