gz-sim
gz-sim copied to clipboard
Basic async primary-secondary stepping
First step towards https://github.com/ignitionrobotics/ign-gazebo/issues/468.
TODOs:
- Add a field in the message the primary send to secondaries that indicates how many steps the secondary can move forward. If that number is 1, the behavior will be exactly the same as before the PR.
- Real time is only being updated by the primary, secondaries are only asynchronously updating the simulated time/iteration number.
Notes on how the case where performers are interacting will be handled (not implemented in this PR), based on a previous discussion with @mjcarroll:
- Performers are far apart -> secondaries can make steps asynchronously
- Performers are seeing each other -> need perfect lockstep. We could also first run the preUpdate/Update phase, sync secondaries state and then run the postUpdate phase. I'm not sure if that's worth it or not.
- Performers interacting each other -> physics should be reallocated to same secondary. We would need to: run preUpdate phase, sync secondaries state, run Update phase (physics), sync secondaries state, run postUpdate phase (all in a perfect lockstep fashion).
I'm not sure if we want to be able to handle a combination of this cases when having more than two secondaries. e.g.: supposing one performer per secondary, A is interacting with B and C is not interacting with neither A or B, should C run asynchronously? In that case, we would need some extra information in the message that the primary sends to the secondaries.
Results when running the distributed_levels
simulation in my computer:
- standalone simulation is completed with RTF of about 800~1200%
- distributed, before this PR, RTF was about 30~40%
- distributed, after this PR, RTF of about 40~80%
@mjcarroll does the approach of this PR seem correct?
It would probably be good to test what's the benefit of asynchronous primary-secondary stepping in a more realistic simulation (not only physics), any recommended world to test that?
It would probably be good to test what's the benefit of asynchronous primary-secondary stepping in a more realistic simulation (not only physics), any recommended world to test that?
As far as platforms, some of the SubT vehicles are probably ideal (longer list here: https://github.com/osrf/subt/wiki/Robots)
- Lidar + Camera: https://app.ignitionrobotics.org/openrobotics/fuel/models/X1%20Config%204
- Really sensor heavy (lidar + 4 kinects): https://app.ignitionrobotics.org/OpenRobotics/fuel/models/EXPLORER_X1_SENSOR_CONFIG_2
I think it would be reasonable to just place 4 of these on a plane 20 meters apart and use that as the "no interaction" case to see what the speedup looks like.
Actual worlds are available in the SubT repo: https://github.com/osrf/subt/tree/master/subt_ign/worlds, but I don't think it's really necessary to use one of those in this stage of development.
force-pushed to solve conflicts
I have been doing some tests today, it seems to me that allowing the secondaries to make steps asynchronously is not of much benefit and the current performance bottleneck is the primary.
Really sensor heavy (lidar + 4 kinects): https://app.ignitionrobotics.org/OpenRobotics/fuel/models/EXPLORER_X1_SENSOR_CONFIG_2
I used two of those models replacing the box and the sphere in the distributed levels simulation, and the secondaries where always "far ahead" and the primary was always going "quite behind", so there was almost not speedup when letting the secondaries make steps asynchronously. The simulation was allowing the secondaries to move ahead 1000 steps, and the primary had always about 500 secondaries iterations cached and was never waiting for a secondary message.
I did some profiling, it seems that SceneBroadcaster::PostUpdate
step is taking much longer (100x) when the simulation is distributed than when it's standalone.
I'm not sure what's the reason, but I assume that that behaves differently when entity/components are being updated from a serialized message than when they are being updated locally.
@mjcarroll I have a bunch of questions:
- are this results expected?
- does it worth exploring why the
SceneBroadcaster::PostUpdate
is much slower in the distributed case and optimizing it?
If we are not going to make the primary go faster, in the only case where there might be a benefit of allowing the secondaries to make steps asynchronously is if the secondaries are really "heavy loaded", but I'm not sure if this is going to be the case in a subT world. Maybe, the world I'm using for the tests is still too simple.
Last commit has the modified "distributed_levels" example I'm using for testing
@mjcarroll I have created some scripts that allows to run N robots in M secondaries, and it records stats using ign_imgui
.
Typical usage:
#standalone
cd ~/my_ws/src/ign-gazebo/examples/scripts/distributed_levels
# Run with 8 robots
# Will generate an `stats_standalone_r8.csv` that can be visualized with `ign_imgui -i <PATH_TO_CSV>`
./standalone.sh -r 8
# distributed, machines running a secondary
cd ~/my_ws/src/ign-gazebo/examples/scripts/distributed_levels
./secondary.sh -r 8 # run with 8 robots
# distributed, machines running the primary
cd ~/my_ws/src/ign-gazebo/examples/scripts/distributed_levels
# Run with 8 robots, 2 secondaries (4 robots each secondary).
# Will generate an `stats_primary_r8_s2.csv` that can be visualized with `ign_imgui -i <PATH_TO_CSV>`
./primary.sh -r 8 -s 2
My ign_imgui
fork is here, and allows recording the histogram and stats to a CSV file and loading them.
@mjcarroll Do you think this is good enough to start testing or should I add something else? The scripts are easy to run and the results are recorded automatically, so testing should be more or less easy.
Example results in my machine:
standalone 2 robots, average RTF 183%
distributed 2 robots, 2 secondaries, average RTF 118%
Build instructions
- Install ignition dome from source like specified here.
-
python3 -m pip install empy
- Checkout
ivanpauno/secondary-async-stepping-and-primary-acknowledges
inign-gazebo
repo. - Clone https://github.com/ivanpauno/ign_imgui in the workspace
src
directory. -
colcon build --merge-install --packages-up-to ignition-gazebo4 ign-imgui
PS: This need empy
installed, I didn't realize that there was already a template to generate the worlds in the distributed_levels folder :man_facepalming:, if not I would have edited that one.
I think that you may have posted the same image twice in your histogram analysis.
I think that you may have posted the same image twice in your histogram analysis.
Ups, corrected
What's the status of this PR, are there plans to continue working on it?
This was part of some of our work. I don't think there are plans to pursue it as part of the current project, but I will land this PR.
Looks like this will also need a linting pass. I'm not sure why the Focal version isn't failing.
I think this is ready for review as it is, and we should be able to merge as it extends the features of the base distributed system implementation.
I also created a design document summarizing the efforts here and the plan for distributed simulation that could help us get in the same page about the previous conversations and possible limitations.
PTAL @nkoenig @chapulina @mjcarroll. Let me know if you think I should add someone else to the discussion.
Codecov Report
Merging #481 (8cd42af) into ign-gazebo6 (0996fe0) will decrease coverage by
0.84%
. The diff coverage is90.90%
.
:exclamation: Current head 8cd42af differs from pull request most recent head 895f043. Consider uploading reports for the commit 895f043 to get more accurate results
@@ Coverage Diff @@
## ign-gazebo6 #481 +/- ##
===============================================
- Coverage 63.00% 62.16% -0.85%
===============================================
Files 301 278 -23
Lines 24262 23425 -837
===============================================
- Hits 15287 14561 -726
+ Misses 8975 8864 -111
Impacted Files | Coverage Δ | |
---|---|---|
src/network/NetworkManagerSecondary.cc | 89.58% <88.88%> (+8.33%) |
:arrow_up: |
src/network/NetworkManagerPrimary.cc | 83.06% <94.73%> (+4.16%) |
:arrow_up: |
...stems/joint_state_publisher/JointStatePublisher.cc | 81.13% <0.00%> (-2.61%) |
:arrow_down: |
...int_position_controller/JointPositionController.cc | 71.85% <0.00%> (-1.53%) |
:arrow_down: |
src/EntityComponentManager.cc | 88.26% <0.00%> (-1.11%) |
:arrow_down: |
...rc/systems/odometry_publisher/OdometryPublisher.cc | 85.35% <0.00%> (-0.72%) |
:arrow_down: |
src/gui/plugins/scene_manager/GzSceneManager.cc | 19.17% <0.00%> (-0.58%) |
:arrow_down: |
src/Link.cc | 94.48% <0.00%> (-0.29%) |
:arrow_down: |
... and 41 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 0996fe0...895f043. Read the comment docs.
I took some time testing this in my personal PC and using EC2 AWS instances. The main purpose was checking that the PR works using multiple computers, but I went ahead and gathered some RTF metrics and organized them in the attached PDF: Distributed simulation RTF testing.pdf
Can I ask you to take a another look to the code? :pray: I think it's ready to merge. @chapulina @nkoenig @mjcarroll
I took some time testing with 5 instances in the AWS environment I prepared, I'm attaching the results.
I also added some extra graphic tables so it's easier to see the comparison between implementations before and after the PR, and between the PC vs AWS cloud environment. PTAL @mjcarroll Distributed simulation RTF testing_5instances.pdf
For reference, I'm attaching here a guide with the steps I followed to setup this PR and test it using AWS instances. Setting up AWS cloud environment for testing Ignition’s distributed simulation.pdf
I changed the testing script to delete older results by default with befd5fc. Otherwise just addressed old comments, I think this is ready to merge. PTAL @mjcarroll
This has been open for quite some time, I don't think there is any intention to pick it up in the near future. I'm going to close it.