OpenROAD icon indicating copy to clipboard operation
OpenROAD copied to clipboard

Improve memory usage of PDN

Open oharboe opened this issue 1 year ago • 17 comments

Describe the problem

  1. unzip https://drive.google.com/file/d/1_yIOwyTIN9uo6PW_HqJT5NjnieWiBeMx/view?usp=sharing
  2. execute command below
./run-me-DigitalTop-asap7-base.sh

Uses lots of memory:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                          
2402025 oyvind    20   0   49,9g  49,4g  57216 R  99,7  78,7   5:23.24 openroad      

Runs 'forever' if it starts to use swap.

Output:

OpenROAD v2.0-11584-gdfb48568b
[...]
[didn't wait for completion, aborted]

die area, 2000um x 2000um:

image

Expected Behavior

Should complete in some reasonable amount of time. Fast PDN is very useful because it is part of iterating on the floorplan.

Environment

OpenROAD v2.0-11584-gdfb48568b

To Reproduce

See above.

Relevant log output

No response

Screenshots

No response

Additional Context

No response

oharboe avatar Dec 27 '23 22:12 oharboe

Some suspend resume stacks up to and including:

OpenROAD v2.0-11584-gdfb48568b 
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
[INFO PDN-0001] Inserting grid: top

image

image

oharboe avatar Dec 27 '23 23:12 oharboe

Some suspend resume snapshots after:

OpenROAD v2.0-11584-gdfb48568b 
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
[INFO PDN-0001] Inserting grid: top

image

image

oharboe avatar Dec 27 '23 23:12 oharboe

After ca. 6 hours, the following output:

[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/dcache
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/frontend/bpd
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/core/FpPipeline
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/frontend/icache

image

oharboe avatar Dec 28 '23 06:12 oharboe

With OpenROAD v2.0-11595-g31d7e3dc5

it took under 1hr:

real	52m24.020s

so I can't reproduce your issue.

maliberty avatar Dec 28 '23 06:12 maliberty

With OpenROAD v2.0-11595-g31d7e3dc5

it took under 1hr:

real	52m24.020s

so I can't reproduce your issue.

What kind of machine do you have? How much L3 cache do you have?

oharboe avatar Dec 28 '23 06:12 oharboe

This is the PC I used. I will try again with OpenROAD v2.0-11595-g31d7e3dc5

Quite a large memory footprint for 16mByte L3...

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                          
2402025 oyvind    20   0   49,9g  49,4g  57216 R  99,7  78,7   5:23.24 openroad      

The run got terminated (as well as other apps) as the machine ran out of swap space, probably. It happened twice, so I'm pretty sure that was what happened.

$ swapon --show
NAME      TYPE SIZE USED PRIO
/swapfile file 128G 1,3G   -2
$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi       2,2Gi        60Gi        55Mi       1,2Gi        60Gi
Swap:          127Gi       1,3Gi       126Gi
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            13
    CPU(s) scaling MHz:  84%
    CPU max MHz:         5000,0000
    CPU min MHz:         800,0000
    BogoMIPS:            7200,00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
                          dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_
                         tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmp
                         erf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pd
                         cm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
                          rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb
                          stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjus
                         t bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsav
                         eopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_windo
                         w hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    2 MiB (8 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Gather data sampling:  Mitigation; Microcode
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; Enhanced IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIB
                         RS SW sequence
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled

oharboe avatar Dec 28 '23 06:12 oharboe

Will try on this machine...

I don't expect much better luck as PDN is running out of memory and into swap space here as well, same amount of physical memory as on my feebler machine.

$ swapon --show
NAME       TYPE SIZE  USED PRIO
/swapfile  file   2G    2G   -2
/swapfile1 file 256G 95,9G   -3
$ free -h
               total        used        free      shared  buff/cache   available
Mem:            62Gi        61Gi       493Mi       3,4Mi       1,5Gi       1,3Gi
Swap:          257Gi        97Gi       160Gi
$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper 3960X 24-Core Processor
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  24
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU(s) scaling MHz:  86%
    CPU max MHz:         3800,0000
    CPU min MHz:         2200,0000
    BogoMIPS:            7585,79
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
                          mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc
                          rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq moni
                         tor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm
                          cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
                          skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb ca
                         t_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 
                         cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_ll
                         c cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbno
                         invd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
                          decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl um
                         ip rdpid overflow_recov succor smca sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   768 KiB (24 instances)
  L1i:                   768 KiB (24 instances)
  L2:                    12 MiB (24 instances)
  L3:                    128 MiB (8 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-47
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Mitigation; untrained return thunk; SMT enabled with STIBP protection
  Spec rstack overflow:  Mitigation; safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-e
                         IBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
$ time ./run-me-DigitalTop-asap7-base.sh 
OpenROAD v2.0-11595-g31d7e3dc5 
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.

oharboe avatar Dec 28 '23 06:12 oharboe

Its fairly old:

% lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    1
Core(s) per socket:    20
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping:              4
CPU MHz:               1068.164
CPU max MHz:           3700.0000
CPU min MHz:           1000.0000
BogoMIPS:              4800.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              28160K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

maliberty avatar Dec 28 '23 07:12 maliberty

@maliberty Relabeled issue as a feature request

oharboe avatar Dec 28 '23 07:12 oharboe

@maliberty Ca. 2x as much L3 as I have. How much RAM do you have? I believe the reason it completes in a reasonable amount of time is that you have a lot of memory...

oharboe avatar Dec 28 '23 08:12 oharboe

I'm not going to do any additional tests, I think we have pretty convincing evidence that the problem is the surprisingly large memory consumption of PDN.

oharboe avatar Dec 28 '23 08:12 oharboe

When I reduce the size of the floorplan from 2000um x 2000um to 1000um x 1000um, then PDN completes in 3 minutes or so, compared to never completing. Memory consumption is well below 10gGByte near as I can see, so memory consumption increases more than proportionally to area, it would seem.

oharboe avatar Dec 28 '23 10:12 oharboe

@oharboe how were some of these LEFs generated? Some of them are 100's of MBs, with 4M+ shapes for the power pins (exposing all the vias internally does not offer a lot of abstractions for the tools). This is something that historically will cause some issues with memory use. I'm not sure why switching from 2000->1000 speed things up by that much. I'm not sure if there is something that PDN can really do (other than do some basic filtering on the shapes) when the abstract views are very detailed. I can take a look to see if there is anything else that could help, but I would take a look at the LEFs to see why they are so large.

gadfort avatar Dec 30 '23 21:12 gadfort

I can take a look to see if there is anything else that could help, but I would take a look at the LEFs to see why they are so large.

Which LEFs are you interested in?

They are from this project https://github.com/The-OpenROAD-Project/megaboom

oharboe avatar Dec 31 '23 12:12 oharboe

@oharboe BoomNonBlockingDCache the LEF is 300MB for example, it looks like it is exposing the entire power grid and probably doesn't have power pins in the actual subblock.

gadfort avatar Dec 31 '23 16:12 gadfort

@oharboe BoomNonBlockingDCache the LEF is 300MB for example, it looks like it is exposing the entire power grid and probably doesn't have power pins in the actual subblock.

I had to create the abstract just after the floorplan as I was working on the floorplan higher up.

Could that explain the unusual characteristics or would you expect the same after place, cts and route?

oharboe avatar Dec 31 '23 17:12 oharboe

@oharboe no, either the power grid specification didn't note the layer to add pins to or the LEF writing is doing something wrong. The later stages should not matter I think for the LEF generation.

gadfort avatar Dec 31 '23 17:12 gadfort

@oharboe this seems to have stalled out on the size of your LEF macros. Is there something here to pursue as this doesn't seem to be a pdn issue.

maliberty avatar Mar 05 '24 23:03 maliberty

@oharboe this seems to have stalled out on the size of your LEF macros. Is there something here to pursue as this doesn't seem to be a pdn issue.

I think this stopped being a problem for me after I started mocking memories.

oharboe avatar Mar 06 '24 04:03 oharboe