Elevated memory usage in def-use
We are seeing terrible elevated memory usage in def-use.
We start like this:
PassRepeated invoking DoSimplifyDefUse
ProcessDefUse invoking P4::ComputeWriteSet
heap after P4::ComputeWriteSet: in use 758MB, max 837MB
ProcessDefUse invoking P4::(anonymous namespace)::FindUninitialized
heap after P4::(anonymous namespace)::FindUninitialized: in use 758MB, max 837MB
and end like this:
heap after SimplifyDefUse: in use 9.2GB, max 9.3GB
Note that in the reality the heap mem size query triggers GC collection. In the reality the peak memory usage could be well beyond 16 Gb.
I do not know yet the culprit, still investigating.
Tagging @ChrisDodd
The usual suspect:
Allocated a total of 2.7GB memory
allocated 737MB in 5576175 calls from:
4 p4c 0x0000000102ff5234 _ZNK2P411Definitions15joinDefinitionsEPKS0_ + 440 0x102ff5234
...
Another example P4 program that shows quite large CPU and memory use during SimplifyDefUse pass: https://forum.p4.org/t/error-during-p4-program-compilation-too-many-heap-sections-increase-maxhincr-or-max-heap-sects/1231/7
Thanks @jafingerhut, this is great small example. We're having (at first use-def invocation):
heap after RemoveUnused: in use 45MB, max 55MB
ProcessDefUse invoking P4::ComputeWriteSet
heap after P4::ComputeWriteSet: in use 2.1GB, max 2.1GB
ProcessDefUse invoking P4::(anonymous namespace)::FindUninitialized
heap after P4::(anonymous namespace)::FindUninitialized: in use 2.1GB, max 2.1GB
ProcessDefUse invoking RemoveUnused
I killed the process after reaching ~6 Gb of used memory during the second use-def.
The code just contains some decent amount of small actions, plus switches and conditions called according to the input. It looks pretty similar in this case to the large apps I saw, though is a bit different (my causes often have bunch of temporaries created by side effect ordering).
And the memory allocation pattern is standard:
Allocated a total of 1.8GB memory
allocated 1.3GB in 2096398 calls from:
??: P4::Definitions::joinDefinitions() [100977f30]
??: P4::ProgramPoints::merge() [1009779b0]
??: absl::lts_20240116::container_internal::raw_hash_set<>::raw_hash_set() [100984c2c]
??: absl::lts_20240116::container_internal::raw_hash_set<>::resize() [100984fbc]
allocated 80MB in 24992 calls from:
??: absl::lts_20240116::container_internal::raw_hash_set<>::insert<>() [100977894]
??: absl::lts_20240116::container_internal::raw_hash_set<>::find_or_prepare_insert<>() [100986820]
??: absl::lts_20240116::container_internal::raw_hash_set<>::prepare_insert() [100986910]
??: absl::lts_20240116::container_internal::raw_hash_set<>::resize() [100984fbc]
I ran it on a system with 16 GBytes of RAM, and I saw the memory usage of p4c increase to between 11 and 12 GBytes, and I think it took around 20 minutes to finish, most of that time being spent in the longer of the two SimplifyDefUse passes that I noticed took significant time.
I ran it on a system with 16 GBytes of RAM, and I saw the memory usage of p4c increase to between 11 and 12 GBytes, and I think it took around 20 minutes to finish, most of that time being spent in the longer of the two SimplifyDefUse passes that I noticed took significant time.
Second one runs after inliner. So, the input code size increases quite a lot. The memory consumption of SimplifyDefUse is definitely superlinear in number of statements as it tracks all "last" writes that happened to the particular program point...
We have disables second SimplifyDefUse downstream – we simply cannot "afford" it. And even the first one is problematic :)