OpenROAD
OpenROAD copied to clipboard
Improve memory usage of PDN
Describe the problem
- unzip https://drive.google.com/file/d/1_yIOwyTIN9uo6PW_HqJT5NjnieWiBeMx/view?usp=sharing
- execute command below
./run-me-DigitalTop-asap7-base.sh
Uses lots of memory:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2402025 oyvind 20 0 49,9g 49,4g 57216 R 99,7 78,7 5:23.24 openroad
Runs 'forever' if it starts to use swap.
Output:
OpenROAD v2.0-11584-gdfb48568b
[...]
[didn't wait for completion, aborted]
die area, 2000um x 2000um:
Expected Behavior
Should complete in some reasonable amount of time. Fast PDN is very useful because it is part of iterating on the floorplan.
Environment
OpenROAD v2.0-11584-gdfb48568b
To Reproduce
See above.
Relevant log output
No response
Screenshots
No response
Additional Context
No response
Some suspend resume stacks up to and including:
OpenROAD v2.0-11584-gdfb48568b
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
[INFO PDN-0001] Inserting grid: top
Some suspend resume snapshots after:
OpenROAD v2.0-11584-gdfb48568b
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
[INFO PDN-0001] Inserting grid: top
After ca. 6 hours, the following output:
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/dcache
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/frontend/bpd
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/core/FpPipeline
[INFO PDN-0001] Inserting grid: ElementGrid - tile_prci_domain/tile_reset_domain/boom_tile/frontend/icache
With OpenROAD v2.0-11595-g31d7e3dc5
it took under 1hr:
real 52m24.020s
so I can't reproduce your issue.
With OpenROAD v2.0-11595-g31d7e3dc5
it took under 1hr:
real 52m24.020s
so I can't reproduce your issue.
What kind of machine do you have? How much L3 cache do you have?
This is the PC I used. I will try again with OpenROAD v2.0-11595-g31d7e3dc5
Quite a large memory footprint for 16mByte L3...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2402025 oyvind 20 0 49,9g 49,4g 57216 R 99,7 78,7 5:23.24 openroad
The run got terminated (as well as other apps) as the machine ran out of swap space, probably. It happened twice, so I'm pretty sure that was what happened.
$ swapon --show
NAME TYPE SIZE USED PRIO
/swapfile file 128G 1,3G -2
$ free -h
total used free shared buff/cache available
Mem: 62Gi 2,2Gi 60Gi 55Mi 1,2Gi 60Gi
Swap: 127Gi 1,3Gi 126Gi
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i9-9900KF CPU @ 3.60GHz
CPU family: 6
Model: 158
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 13
CPU(s) scaling MHz: 84%
CPU max MHz: 5000,0000
CPU min MHz: 800,0000
BogoMIPS: 7200,00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_
tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmp
erf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pd
cm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb
stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjus
t bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsav
eopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_windo
w hwp_epp vnmi md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 2 MiB (8 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Gather data sampling: Mitigation; Microcode
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; Enhanced IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIB
RS SW sequence
Srbds: Mitigation; Microcode
Tsx async abort: Mitigation; TSX disabled
Will try on this machine...
I don't expect much better luck as PDN is running out of memory and into swap space here as well, same amount of physical memory as on my feebler machine.
$ swapon --show
NAME TYPE SIZE USED PRIO
/swapfile file 2G 2G -2
/swapfile1 file 256G 95,9G -3
$ free -h
total used free shared buff/cache available
Mem: 62Gi 61Gi 493Mi 3,4Mi 1,5Gi 1,3Gi
Swap: 257Gi 97Gi 160Gi
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 3960X 24-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 86%
CPU max MHz: 3800,0000
CPU min MHz: 2200,0000
BogoMIPS: 7585,79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc
rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq moni
tor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm
cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb ca
t_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2
cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_ll
c cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbno
invd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl um
ip rdpid overflow_recov succor smca sev sev_es
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 768 KiB (24 instances)
L1i: 768 KiB (24 instances)
L2: 12 MiB (24 instances)
L3: 128 MiB (8 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-47
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Spec rstack overflow: Mitigation; safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-e
IBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
$ time ./run-me-DigitalTop-asap7-base.sh
OpenROAD v2.0-11595-g31d7e3dc5
This program is licensed under the BSD-3 license. See the LICENSE file for details.
Components of this program may be licensed under more restrictive licenses which must be honored.
Its fairly old:
% lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 1
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 1068.164
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
@maliberty Relabeled issue as a feature request
@maliberty Ca. 2x as much L3 as I have. How much RAM do you have? I believe the reason it completes in a reasonable amount of time is that you have a lot of memory...
I'm not going to do any additional tests, I think we have pretty convincing evidence that the problem is the surprisingly large memory consumption of PDN.
When I reduce the size of the floorplan from 2000um x 2000um to 1000um x 1000um, then PDN completes in 3 minutes or so, compared to never completing. Memory consumption is well below 10gGByte near as I can see, so memory consumption increases more than proportionally to area, it would seem.
@oharboe how were some of these LEFs generated? Some of them are 100's of MBs, with 4M+ shapes for the power pins (exposing all the vias internally does not offer a lot of abstractions for the tools). This is something that historically will cause some issues with memory use. I'm not sure why switching from 2000->1000 speed things up by that much. I'm not sure if there is something that PDN can really do (other than do some basic filtering on the shapes) when the abstract views are very detailed. I can take a look to see if there is anything else that could help, but I would take a look at the LEFs to see why they are so large.
I can take a look to see if there is anything else that could help, but I would take a look at the LEFs to see why they are so large.
Which LEFs are you interested in?
They are from this project https://github.com/The-OpenROAD-Project/megaboom
@oharboe BoomNonBlockingDCache
the LEF is 300MB for example, it looks like it is exposing the entire power grid and probably doesn't have power pins in the actual subblock.
@oharboe
BoomNonBlockingDCache
the LEF is 300MB for example, it looks like it is exposing the entire power grid and probably doesn't have power pins in the actual subblock.
I had to create the abstract just after the floorplan as I was working on the floorplan higher up.
Could that explain the unusual characteristics or would you expect the same after place, cts and route?
@oharboe no, either the power grid specification didn't note the layer to add pins to or the LEF writing is doing something wrong. The later stages should not matter I think for the LEF generation.
@oharboe this seems to have stalled out on the size of your LEF macros. Is there something here to pursue as this doesn't seem to be a pdn issue.
@oharboe this seems to have stalled out on the size of your LEF macros. Is there something here to pursue as this doesn't seem to be a pdn issue.
I think this stopped being a problem for me after I started mocking memories.