p4c
p4c copied to clipboard
Reduce Visitor overhead
Visitors are central objects in P4C architecture. However, the current implementation imposes lots of overheads and some inefficiencies.
I would name some of those that draw my attention
- They create lots of GC/malloc traffic. One of main sources of this is tracking visited nodes. For example, on P4CParserUnroll.switch_20160512 testcase we are having 50k calls to
Inspector::init_apply
and 138k calls toTransform::init_apply
. On the real-world sources I had cases 10x larger. Each of such calls creates a new map to track visited nodes (just visited plus few bits forInspector
andChangeTracker
forModifier
andTransform
). For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied.ForwardChildren
takes a reference to it.
What does matter is that majority of these maps are small. For that 50k calls to Inspector
we are having 10k maps of size 1 (essentially node with no children), another 10k maps of size 2 (so, single children and nothing else) and approx 45k (so 90% of total) of maps of size <= 16. Rehashing and corresponding memory allocation take quite some visible amount of time in overall profile. Note that std::unordered_map
is quite malloc-heavy due to restrictions and guarantees required by C++ standard. Situation with Transforms
is similar – 85% of nodes are contained in small maps (< 16 entries).
We definitely need to optimize for small map case, e.g. having space pre-allocated and implementing map adapter on top of plain vector (aka flat_map
). The linear scan over few vector elements won't be much slower than hash table lookup as evidence shows us.
The amount of map lookups is also quite large. We'd probably switch from std::unordered_map
into some more performant solution.
-
Transforms are cloning always. Regardless whether node is changed or not. Likely this should be changed to some lazy implementation. However, it seems that some passes rely heavily on the way how nodes are currently traversed and what are the implementation details there.
-
Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action.
-
Visitors use textbook double-dispatch. So, every
apply
essentially costs 2 virtual function calls for each node.
Now this could be solved with static dispatch. With pre-defined typeids we can just switch over them, eliminating one virtual function call. The second virtual function call could be eliminating via CRTP.
That said, these are not something easy or isolated changes as Visitors are everywhere :) I will be contributing patches for some of these issues. Spoiler: for some downstream code we saw 4x reduction in the compile time via removing of various overheads here and there.
For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied.
ForwardChildren
takes a reference to it.
Could this be because of the SplitFlowVisitor
that creates copies (clones) of visitors? That one is a nasty one :-(.
What does matter is that majority of these maps are small. For that 50k calls to
Inspector
we are having 10k maps of size 1 (essentially node with no children), another 10k maps of size 2 (so, single children and nothing else) and approx 45k (so 90% of total) of maps of size <= 16. Rehashing and corresponding memory allocation take quite some visible amount of time in overall profile. Note thatstd::unordered_map
is quite malloc-heavy due to restrictions and guarantees required by C++ standard. Situation withTransforms
is similar – 85% of nodes are contained in small maps (< 16 entries).
Interesting, seems like we would be able to improve it a lot.
We definitely need to optimize for small map case, e.g. having space pre-allocated and implementing map adapter on top of plain vector (aka
flat_map
). The linear scan over few vector elements won't be much slower than hash table lookup as evidence shows us.The amount of map lookups is also quite large. We'd probably switch from
std::unordered_map
into some more performant solution.
Definitely, some open addressing/closed hashing scheme (without overflow buckets) would likely work better for the "large" case based on my experience.
2. Transforms are cloning always. Regardless whether node is changed or not. Likely this should be changed to some lazy implementation. However, it seems that some passes rely heavily on the way how nodes are currently traversed and what are the implementation details there.
I was thinking about this too. The big problem I think is the children forwarding and seeing the right children in postorder. Which seems to be reasonable requirement, but makes the general case quite hard to implement without cloning.
We could possibly have a lightweight transform for preorder-only cases, or we would need some other way of passing the children to the {pre,post}order
functions than in the clone of the current node. I was thinking about the latter also for different reason -- the current Transform is a point to work with when one wants to lower IR into nodes that do not neatly fit into the abstract classes defined in the P4 IR (i.e. when one lowers statements and expression into something else). The visitor can't really handle this case and it is necessary to wrap the values into some artificial wrappers to meet the require invariants.
So far I did not found anything that would not require major extension of the IR hirearchies' functionality though (something like a fold from functional programming -- i.e. a postorder
function taking not only the original node, but also all the processed children -- would work, but it would be quite tricky and require uglier boilerplate then the visitors).
3. Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action. 4. Visitors use textbook double-dispatch. So, every `apply` essentially costs 2 virtual function calls for each node.
Is this really a problem though? The only alternative would be RTTI-based dispatch, that would have just one huge switch (or something like that), but would that be significantly faster than two virtual calls? And virtual call would be needed to cast to the right type anyway (and to visit children).
Now this could be solved with static dispatch. With pre-defined typeids we can just switch over them, eliminating one virtual function call. The second virtual function call could be eliminating via CRTP.
That said, these are not something easy or isolated changes as Visitors are everywhere :) I will be contributing patches for some of these issues. Spoiler: for some downstream code we saw 4x reduction in the compile time via removing of various overheads here and there.
Nice, thank you. I wonder what you have. I would definitely be interested in improvements around the maps and transform. I am little skeptical about the double-dispatch elimination but I will wait and see what parts you have in your codebase.
The big problem I think is the children forwarding and seeing the right children in postorder. Which seems to be reasonable requirement, but makes the general case quite hard to implement without cloning.
I would say even more. Some of current passes exploits that even childrens are "finalized" (e.g. when processing second children they expect that first one is already transformed and put into the proper place into parent node). But in such case indeed it would be better to have clones.
Is this really a problem though? The only alternative would be RTTI-based dispatch, that would have just one huge switch (or something like that), but would that be significantly faster than two virtual calls? And virtual call would be needed to cast to the right type anyway (and to visit children).
The current round-trip requires 4 function calls, two of them might be inlined and two of them are virtual. CRTP approach would require one virtual call for typeid (actually we can make it non-virtual storing the typeid in the member, but this is a separate story). Casting is static from base to immediate derived both for Visitor and for Node, no virtual bases involved, so here there will be at least 1 virtual call less. Plus empty cases could be potentially inlined into that switch, simplifying it a bit more.
Children are separate and orthogonal. I would say that recursive manner produce gigantic call stacks and essentially hide the flow, making overall reasoning and profiling quite hard.
That said, there are lots of possible small improvements here and there. Some of them likely could contribute ~ 0.5%, so the effect could only be visible combined.
@ChrisDodd
That said, these are not something easy or isolated changes as Visitors are everywhere :)
One thing to start with this is to introduce a new visitor class in visitor.h
and selectively replace the derived existing visitors. This way you can isolate the fallout. Could even be a copy of the existing Transform
with modifications.
That said, these are not something easy or isolated changes as Visitors are everywhere :)
One thing to start with this is to introduce a new visitor class in
visitor.h
and selectively replace the derived existing visitors. This way you can isolate the fallout. Could even be a copy of the existingTransform
with modifications.
Thanks for the tip! Sure, this is the way forward ;)
So, I tried to understand the logic behind shared_ptr
for visited
trackers inside Visitors. And it seems it does nothing wrt flow clones, etc. shared_ptrs
were introduced in https://github.com/p4lang/p4c/pull/3168 as way to "better track lifetime". This raises some questions as it is likely I'm missing something:
-
Why
shared_ptr
was used and notunique_ptr
? Or just plaindelete
on the pointer? Note that whileSplitFlowVisitor
pretend to do some "cloning" (though by default these are just shallow clones – we just copyVisitor
pointer), the overall logic is not correct wrt the liveness ofvisited
: we explicitlyreset
it after apply so that if someone else already owned this shared pointer, then the behavior is undefined per standard. Looks like the object is entirely owned byVisitor
: its lifecycle is controlled byinit_apply
andapply_visitor
(the latter does not make sense to me, it should be better released inend_apply
, currently we do explicitctx
check to ensure that it's the "last"/top-levelapply_visitor
. -
Inspector
/Modifier
/Transform
are not quite deeply copied. Instead pointers to them fly here and there. So maybe we can be more explicit on lifetime of different things? Make visitors move-only things (as they are stateful) and enforce proper semantics when cloning is necessary?
Here is rationale for all this: In the majority of cases these visited
trackers are small. They contain less than 16 nodes and often it's only 1-2. Therefore, as stated above, I'd like to introduce small map optimization: let visitor have pre-allocated space to, say, 16 records and switch to larger hashmaps only when necessary. This way we will get rid of these extra memory allocations at all significantly reduce GC traffic (see the numbers above).
But maybe I'm missing something and there is some backstory here?
@vlstill @ChrisDodd @grg @fruffy do you happen to know?
I personally do not know much about the choices made for the visitors, so your guess is as good as mine.
When does a visitor need to be copied, for example?
When does a visitor need to be copied, for example?
First case when we're adding Visitor
to PassManager
via const reference. The second case is the default implementation of ControlFlowVisitor::flow_clone()
. In both cases Visitor::clone()
is called.
However, the particular clone
implementations seem to show different behaviours:
-
DoLocalCopyPropagation
invokes default copy constructor. -
TypeInference
only copies itself, not base classes
No particular clone
is defined for Inspector
/ Modifier
/ Transform
. I do not know what's downstream :)
IMO all this is very misleading and could lead to subtle issues everywhere. We certainly need to stop accepting Visitor
by const reference in PassManager
, it only makes sense for stateless Inspector
's and similar visitors (though it seems to be unused in the upstream codebase).
Lots of points here
For some reasons these maps are wrapped into shared_ptr's, though I do not see where they are copied.
This comes from flow_clone/flow_merge which clone the Visitor (pointing at the same tracker), though the way thing are structured, the first copy of the visitor (which creates the tracker) should always also be the last (alive), so perhaps that could be leveraged to avoid the shared_ptr overhead if that is significant
Transforms are cloning always. Regardless whether node is changed or not.
Yes, this is a sore point. The original C code this is based on would do that clone on the stack (creating a copy as a local var in apply_visitor
) and only copy that to the heap at the end if any changes were made. Unfortunately there's no good way of doing that in C++. Maybe a union
of all possible IR types and a placement-new clone? This has a problem if a transform preorder/postorder method tries to stash away a pointer to this clone, which should not be done.
When we were first moving to using GC, I assumed that the garbage collector's malloc would be very efficient (as it just needs to bump a pointer), so the benefit of trying to use a stack clone would be minimal.
Visitors use textbook double-dispatch. So, every apply essentially costs 2 virtual function calls for each node.
Yes. When this code was written, it was assumed that by the time C++17 came out, the language would have mulitple dispatch and implementations would catch up with implementing it efficiently. Never happened. Classic problem with C++ -- lots of things never get implemented efficiently as "noone uses them", but the only reason noone uses them is that they are inefficient so everyone comes up with hacks to work around the problem, rather than just writing clean code that expresses what it is doing.
Visitors are written in recursive manner. This generates huge call stacks and results in lots of virtual function calls for almost every action.
Yes. Referring back to the original C code that inspired this code, it used a manually managed stack (fixed size array) of Context
structures (could be a local var in apply_visitor
, or allocated elsewhere for systems with limited stack space such as Windows drivers) and would only make a recursive call when that fixed array filled up. Doing the same in C++ should be possible, but non-trivial.
What does matter is that majority of these maps are small. ... Note that
std::unordered_map
is quite malloc-heavy
Yes this is another place that hvec_map should be an improvement.
When we were first moving to using GC, I assumed that the garbage collector's malloc would be very efficient (as it just needs to bump a pointer), so the benefit of trying to use a stack clone would be minimal.
Sadly, it is not. BDWGC is not the fastest malloc implementation itself. Also, there are additional overheads as it essentially memset(ptr, size, 0)
all allocated / freed memory (so remove possible dangling pointers)...
IMO all this is very misleading and could lead to subtle issues everywhere. We certainly need to stop accepting Visitor by const reference in PassManager
Intuitively I would say a PassManager should own any visitor passed to it. I can't think of a sensible scenario where a visitor is copied into a pass manager and still used independently.
Intuitively I would say a PassManager should own any visitor passed to it. I can't think of a sensible scenario where a visitor is copied into a pass manager and still used independently.
There are cases when an Inspector
is used in a pipeline and later the collected results are reused. See e.g. https://github.com/p4lang/p4c/blob/main/frontends/p4/frontend.cpp#L238
There are cases when an
Inspector
is used in a pipeline and later the collected results are reused. See e.g. https://github.com/p4lang/p4c/blob/main/frontends/p4/frontend.cpp#L238
Ah, I definitely have used this pattern before. Although I question whether doing that within a PassManager is a good idea. The VisitorRef implementation seems a little muddy, especially considering the unclear semantics of clone
as you pointed out.
Ah, I definitely have used this pattern before. Although I question whether doing that within a PassManager is a good idea. The VisitorRef implementation seems a little muddy, especially considering the unclear semantics of
clone
as you pointed out.
And actually clone
there is never used. At least upstream.