Poor performance of SDF Element Pointer Construction
Environment
- OS Version: Ubuntu 22.04
- Source
Description
- Expected behavior: Destroying
sdf::Elementshould be relatively fast. - Actual behavior:
sdf::Elementtakes forever to destroy.
Steps to reproduce
Here is a minimal program:
#include <sdf/Model.hh>
#include <sdf/Root.hh>
void deserialize()
{
std::string sdf = "<?xml version=\"1.0\" ?><sdf version='1.6'></sdf>";
sdf::Root root;
sdf::Errors errors = root.LoadSdfString(sdf);
if (!root.Model())
{
return;
}
sdf::Model _model = *root.Model();
return;
}
int main(int argc, char **argv)
{
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
deserialize();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Whole function call took: " <<
std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count()
<< "[µs]" << std::endl;
}
On my 8 core intel i7-11850H @ 2.50GHz with 64GB of ram I get the following slow speed:
Whole function call took: 108759[µs]
I.E: It takes 0.1s to parse, validate and destroy what is effectively an empty xml file. We use this function extensively in Gazebo. For instance, when we serialize and deserialize model components. In fact this exact set of steps runs everytime a model is deserialized, hence when we try to run 3k_worlds.sdf where we load 3k shapes, we end up spending 300s just to deserialize the initial scene to the user. In fact when we run perf on the simulated world during, we see a large amount of time being spent inside the sdf library. Additionally, if we were to exit early (as done in this hack) instead of trying to deserialize we see a massive speedup (22s to load vs 300s, I suspect there is additional overhead caused by this bug inside renderutils as well). There are other places where we create and destroy, sdf elements within gazebo fairly frequently, I think its important we address this issue as it can lead to a much better user experience for end-users.
Output
I wonder, do you have a flame graph for this test as well? It would be nice to figure out which function calls are the bottleneck (your comment implies that it's the destruction?)
Sure. Here it is: https://drive.google.com/file/d/1TS4VsLKVfTsSNSTF0d-q88Ndq2tMXG_t/view?usp=drive_link
I seem to be missing some frame information, not sure exactly why but, the destructor does take a lot of time.
I'm not really an expert on sdformat but it seems that the bulk of the time spent in the destruction of XML Elements is actually spent in... Construction of XML Elements? It would be interesting to setup a breakpoint and step through the code to understand what is the path that leads to construction within the destructor
I've been unable to find a way to reproduce the destructor issues. It is possible that the flamegraph is not 100% accurate. However, based on GDB, I've found that the construction itself is very expensive as we reparse the specification every single time. By cacheing the initial parse of the spec as done in #1479 , we can significantly improve performance of repeated construction.
From reading through the SDFormat parsing code, I see that there is a critical loop in the readXml function. The readXml function is an internal function called by the readDoc functions and called by itself recursively. An ElementPtr _sdf argument is passed to readXml, and a for loop iterates over the child elements of _sdf. Each child element with a matching SDF description is then cloned with Element::Clone and then passed recursively to readXml.
So Element::Clone is worth some scrutiny, and as implied by the name, it includes some dynamic memory allocation. There are a few places where we could call std::vector::reserve before calling vector::push_back in a loop (see https://github.com/gazebosim/sdformat/pull/1480), which would help a small amount. Ultimately, I think a bigger issue is that each Element stores a copy of its schema in private member variables like elementDescriptions and attributes rather than holding a reference to a canonical representation of the schema. Changing this would be a bigger endeavor but could reduce the amount of memory allocations needed during Element::Clone.
Hey Steve, I agree that Element::Clone() also needs some scrutiny. I also noticed that we reparse attributes after cloning. I've got a draft PR up for that.
Hey Steve, as I agree that
Element::Clone()also needs some scrutiny. I also noticed that we reparse attributes after cloning. I've got a draft PR up for that.
yeah, thanks I just noticed your PR (https://github.com/gazebosim/sdformat/pull/1479) after posting my comment, and because I didn't look into the calls to Param::SetParentElement in Element::Clone, I didn't notice that Param::SetParentElement calls Param::Reparse, which you are tinkering with in https://github.com/gazebosim/sdformat/pull/1479. Note also that the Param::Reparse method was added in https://github.com/gazebosim/sdformat/pull/589.
Ran into a long start up time issue with the Jetty demo world and narrowed it down to model sdf deserialization: https://github.com/gazebosim/jetty_demo/issues/3
This issue pretty much sums up the problem.
My recommendation is let's see where we are instantiating sdf::Root and eleminate unessecary instantiations. I had found some places for instance here last year: gazebosim/gz-sim#2596
If you search for ModelSdf, it's only used by systems that run on the server. For some reason I thought https://github.com/gazebosim/gz-sim/pull/2596 was skipping all serialization of models. Instead, it's only skipping the ones that have relative_to tags. We should really skip serializing models altogether.
In https://github.com/gazebosim/gz-sim/pull/2596#issuecomment-2334581609, we discussed whether this is considered a critical bug so that it can be merged during the code freeze. If it's really hampering our ability to create the demo world, I'd say let's go ahead and make this change.
We should really skip serializing models altogether.
I tested this and the jetty demo world load time reduced from 55s to 42s on my (old) machine. Most importantly, hitting the Play button now gives an immediate response (previously it was taking >30 seconds to resume from paused state). I think it's worth making this change, while we continue to look for other perf issues
here's a draft PR for testing: https://github.com/gazebosim/gz-sim/pull/3070