Basic async primary-secondary stepping

Open ivanpauno opened this issue 3 years ago • 16 comments

First step towards https://github.com/ignitionrobotics/ign-gazebo/issues/468.

TODOs:

Add a field in the message the primary send to secondaries that indicates how many steps the secondary can move forward. If that number is 1, the behavior will be exactly the same as before the PR.
Real time is only being updated by the primary, secondaries are only asynchronously updating the simulated time/iteration number.

Notes on how the case where performers are interacting will be handled (not implemented in this PR), based on a previous discussion with @mjcarroll:

Performers are far apart -> secondaries can make steps asynchronously
Performers are seeing each other -> need perfect lockstep. We could also first run the preUpdate/Update phase, sync secondaries state and then run the postUpdate phase. I'm not sure if that's worth it or not.
Performers interacting each other -> physics should be reallocated to same secondary. We would need to: run preUpdate phase, sync secondaries state, run Update phase (physics), sync secondaries state, run postUpdate phase (all in a perfect lockstep fashion).

I'm not sure if we want to be able to handle a combination of this cases when having more than two secondaries. e.g.: supposing one performer per secondary, A is interacting with B and C is not interacting with neither A or B, should C run asynchronously? In that case, we would need some extra information in the message that the primary sends to the secondaries.

Results when running the distributed_levels simulation in my computer:

standalone simulation is completed with RTF of about 800~1200%
distributed, before this PR, RTF was about 30~40%
distributed, after this PR, RTF of about 40~80%

@mjcarroll does the approach of this PR seem correct?

It would probably be good to test what's the benefit of asynchronous primary-secondary stepping in a more realistic simulation (not only physics), any recommended world to test that?

Dec 10 '20 21:12 ivanpauno

It would probably be good to test what's the benefit of asynchronous primary-secondary stepping in a more realistic simulation (not only physics), any recommended world to test that?

As far as platforms, some of the SubT vehicles are probably ideal (longer list here: https://github.com/osrf/subt/wiki/Robots)

Lidar + Camera: https://app.ignitionrobotics.org/openrobotics/fuel/models/X1%20Config%204
Really sensor heavy (lidar + 4 kinects): https://app.ignitionrobotics.org/OpenRobotics/fuel/models/EXPLORER_X1_SENSOR_CONFIG_2

I think it would be reasonable to just place 4 of these on a plane 20 meters apart and use that as the "no interaction" case to see what the speedup looks like.

Actual worlds are available in the SubT repo: https://github.com/osrf/subt/tree/master/subt_ign/worlds, but I don't think it's really necessary to use one of those in this stage of development.

Dec 10 '20 21:12 mjcarroll

force-pushed to solve conflicts

Dec 11 '20 13:12 ivanpauno

I have been doing some tests today, it seems to me that allowing the secondaries to make steps asynchronously is not of much benefit and the current performance bottleneck is the primary.

Really sensor heavy (lidar + 4 kinects): https://app.ignitionrobotics.org/OpenRobotics/fuel/models/EXPLORER_X1_SENSOR_CONFIG_2

I used two of those models replacing the box and the sphere in the distributed levels simulation, and the secondaries where always "far ahead" and the primary was always going "quite behind", so there was almost not speedup when letting the secondaries make steps asynchronously. The simulation was allowing the secondaries to move ahead 1000 steps, and the primary had always about 500 secondaries iterations cached and was never waiting for a secondary message.

I did some profiling, it seems that SceneBroadcaster::PostUpdate step is taking much longer (100x) when the simulation is distributed than when it's standalone. I'm not sure what's the reason, but I assume that that behaves differently when entity/components are being updated from a serialized message than when they are being updated locally.

@mjcarroll I have a bunch of questions:

are this results expected?
does it worth exploring why the SceneBroadcaster::PostUpdate is much slower in the distributed case and optimizing it?

If we are not going to make the primary go faster, in the only case where there might be a benefit of allowing the secondaries to make steps asynchronously is if the secondaries are really "heavy loaded", but I'm not sure if this is going to be the case in a subT world. Maybe, the world I'm using for the tests is still too simple.

Dec 11 '20 21:12 ivanpauno

Last commit has the modified "distributed_levels" example I'm using for testing

Dec 11 '20 21:12 ivanpauno

@mjcarroll I have created some scripts that allows to run N robots in M secondaries, and it records stats using ign_imgui.

Typical usage:

#standalone
cd ~/my_ws/src/ign-gazebo/examples/scripts/distributed_levels
# Run with 8 robots
# Will generate an `stats_standalone_r8.csv` that can be visualized with `ign_imgui -i <PATH_TO_CSV>`
./standalone.sh -r 8

# distributed, machines running a secondary
cd ~/my_ws/src/ign-gazebo/examples/scripts/distributed_levels
./secondary.sh -r 8  # run with 8 robots

# distributed, machines running the primary
cd ~/my_ws/src/ign-gazebo/examples/scripts/distributed_levels
# Run with 8 robots, 2 secondaries (4 robots each secondary).
# Will generate an `stats_primary_r8_s2.csv` that can be visualized with `ign_imgui -i <PATH_TO_CSV>`
./primary.sh -r 8 -s 2

My ign_imgui fork is here, and allows recording the histogram and stats to a CSV file and loading them.

@mjcarroll Do you think this is good enough to start testing or should I add something else? The scripts are easy to run and the results are recorded automatically, so testing should be more or less easy.

Example results in my machine:

standalone 2 robots, average RTF 183%

Screenshot from 2020-12-17 18-37-41

distributed 2 robots, 2 secondaries, average RTF 118%

Screenshot from 2020-12-17 18-37-03

Build instructions

Install ignition dome from source like specified here.
python3 -m pip install empy
Checkout ivanpauno/secondary-async-stepping-and-primary-acknowledges in ign-gazebo repo.
Clone https://github.com/ivanpauno/ign_imgui in the workspace src directory.
colcon build --merge-install --packages-up-to ignition-gazebo4 ign-imgui

PS: This need empy installed, I didn't realize that there was already a template to generate the worlds in the distributed_levels folder :man_facepalming:, if not I would have edited that one.

Dec 17 '20 21:12 ivanpauno

I think that you may have posted the same image twice in your histogram analysis.

Dec 18 '20 14:12 mjcarroll

I think that you may have posted the same image twice in your histogram analysis.

Ups, corrected

Dec 18 '20 14:12 ivanpauno

What's the status of this PR, are there plans to continue working on it?

May 04 '21 00:05 chapulina

This was part of some of our work. I don't think there are plans to pursue it as part of the current project, but I will land this PR.

May 04 '21 00:05 mjcarroll

Looks like this will also need a linting pass. I'm not sure why the Focal version isn't failing.

Jun 14 '21 20:06 mjcarroll

I think this is ready for review as it is, and we should be able to merge as it extends the features of the base distributed system implementation.

I also created a design document summarizing the efforts here and the plan for distributed simulation that could help us get in the same page about the previous conversations and possible limitations.

PTAL @nkoenig @chapulina @mjcarroll. Let me know if you think I should add someone else to the discussion.

Jun 18 '21 18:06 Blast545

Codecov Report

Merging #481 (8cd42af) into ign-gazebo6 (0996fe0) will decrease coverage by 0.84%. The diff coverage is 90.90%.

:exclamation: Current head 8cd42af differs from pull request most recent head 895f043. Consider uploading reports for the commit 895f043 to get more accurate results

@@               Coverage Diff               @@
##           ign-gazebo6     #481      +/-   ##
===============================================
- Coverage        63.00%   62.16%   -0.85%     
===============================================
  Files              301      278      -23     
  Lines            24262    23425     -837     
===============================================
- Hits             15287    14561     -726     
+ Misses            8975     8864     -111

Impacted Files	Coverage Δ
src/network/NetworkManagerSecondary.cc	`89.58% <88.88%> (+8.33%)`	:arrow_up:
src/network/NetworkManagerPrimary.cc	`83.06% <94.73%> (+4.16%)`	:arrow_up:
...stems/joint_state_publisher/JointStatePublisher.cc	`81.13% <0.00%> (-2.61%)`	:arrow_down:
...int_position_controller/JointPositionController.cc	`71.85% <0.00%> (-1.53%)`	:arrow_down:
src/EntityComponentManager.cc	`88.26% <0.00%> (-1.11%)`	:arrow_down:
...rc/systems/odometry_publisher/OdometryPublisher.cc	`85.35% <0.00%> (-0.72%)`	:arrow_down:
src/gui/plugins/scene_manager/GzSceneManager.cc	`19.17% <0.00%> (-0.58%)`	:arrow_down:
src/Link.cc	`94.48% <0.00%> (-0.29%)`	:arrow_down:
... and 41 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0996fe0...895f043. Read the comment docs.

Jun 18 '21 18:06 codecov[bot]

I took some time testing this in my personal PC and using EC2 AWS instances. The main purpose was checking that the PR works using multiple computers, but I went ahead and gathered some RTF metrics and organized them in the attached PDF: Distributed simulation RTF testing.pdf

Can I ask you to take a another look to the code? :pray: I think it's ready to merge. @chapulina @nkoenig @mjcarroll

Jul 15 '21 21:07 Blast545

I took some time testing with 5 instances in the AWS environment I prepared, I'm attaching the results.

I also added some extra graphic tables so it's easier to see the comparison between implementations before and after the PR, and between the PC vs AWS cloud environment. PTAL @mjcarroll Distributed simulation RTF testing_5instances.pdf

Jul 23 '21 20:07 Blast545

For reference, I'm attaching here a guide with the steps I followed to setup this PR and test it using AWS instances. Setting up AWS cloud environment for testing Ignition’s distributed simulation.pdf

Jul 29 '21 21:07 Blast545

I changed the testing script to delete older results by default with befd5fc. Otherwise just addressed old comments, I think this is ready to merge. PTAL @mjcarroll

Aug 09 '21 17:08 Blast545

This has been open for quite some time, I don't think there is any intention to pick it up in the near future. I'm going to close it.

Apr 21 '23 11:04 mjcarroll

gz-sim gz-sim copied to clipboard

Basic async primary-secondary stepping

Codecov Report

gz-sim
gz-sim copied to clipboard