drake Idea: modeling long-running asynchronous computation in the system's framework

Related to #10383 -- but that proposal is about doing more efficient computation within one timestep (e.g. we want to run ICP refinement in parallel from many different initial guesses).

Another concept that we need is the idea that some computation might want to run for a potentially variable duration in terms of simulation time. For example, if we dispatch a high-level planner while continuing to execute our low-level control... we would like to start executing that plan as soon as the planner returns, but that return time might be highly variable (due to computational load, but also do to the fundamental randomness/variability in the run-time of many of our planners).

Here is one somewhat concrete proposal to start that discussion: We offer an AsynchronousSystem which runs in one of two modes:

Dispatches a separate thread, performs the Calc method in that thread (e.g. with a copy of the Context that is snapshotted at the time of the dispatch), and upon completion schedules an UnrestrictedUpdateEvent to write the results. The result of simulating the system in this mode will not be deterministic/repeatable, but it is the mode we most likely want for the real systems.
Dispatch the same thread, but also immediately draw a completion_time from a random distribution (using all of the random number machinery in the framework). The system would immediately schedule an UnrestrictedUpdateEvent at that completion time, and upon hitting that event, it would block until the computation thread joins, then write the results. This mode would be deterministic -- given the random seed passed through the System's framework -- and would permit analysis of the implications of having variable duration computational elements in the diagram.

Since it is likely that we'd either want ALL AsynchronousSystems in the Diagram to be in either mode 1 or mode 2, we might want to offer a variable in the Context alongside the accuracy parameter to toggle this behavior.

This is perhaps the most specific conception I've had in my head about how we might do rigorous analysis/testing of systems with e.g. long-running planners in the loop. I think it can be an important concept.

cc @jwnimmer-tri , @sammy-tri , @sherm1, @edrumwri

Mar 30 '19 13:03 RussTedrake

I like this. Do you have a specific reason for wanting it to be a System rather than DrakeVisualizer like utilities?

Mar 31 '19 03:03 edrumwri

The goal is to rigorously analyze a diagram with subsystems that have this property/specification (esp planners).

Mar 31 '19 11:03 RussTedrake

Just to connect ideas: @ggould-tri has a ThreadedSystem in anzu that successfully uses a "mailbox" approach to synchronization between the main sim thread and the long-running computation thread. It's a likely candidate for porting to Drake.

Jul 13 '22 11:07 RussTedrake

Synopsis: A system constructed with a list of port properties and a function whose inputs and return values match those port properties. The system constructs those ports and starts a thread that repeatedly runs the function on the inputs and sets the output to its return values.

This might not be the most ergonomic approach (a builder or two-phase initializer might be more pleasant), but it definitely works and is useful in practice. And of course a std::function can be anything in practice -- a function, a bound method, a lambda, a whole functor object, etc.

Skipping the pages-long doxygen, the public synopsis is:

class ThreadSystem : public drake::systems::LeafSystem<double> {
 public:
  DRAKE_NO_COPY_NO_MOVE_NO_ASSIGN(ThreadSystem);

  using DiscreteValuesRunnable =
      std::function<std::vector<Eigen::VectorXd>(
          const std::vector<Eigen::VectorXd>&,
          const std::vector<std::unique_ptr<drake::AbstractValue>>&)>;

  using DiscretePortInfo = std::tuple<std::string, int>;
  using AbstractPortInfo = std::tuple<std::string, drake::AbstractValue*>;

  ThreadSystem(const std::vector<DiscretePortInfo>& discrete_input_ports_info,
               const std::vector<AbstractPortInfo>& abstract_input_ports_info,
               const std::vector<DiscretePortInfo>& discrete_output_ports_info,
               DiscreteValuesRunnable runnable);

Jul 13 '22 13:07 ggould-tri

Hi, I'm commenting to this issue since I have a similar feature request or more problem to tackle. I'm working on an MPC that takes around 0.1s to solve the control problem and come sup with an optimal input. When implementing this in drake using the drake Simulation the simulator waits for the result which is nice since the input is optimal, however this is not how it works in reality: In reality the system would have evolved during the 0.1s and therefore the input will be applied at a different state that it is calculated for. And this is exactly what I expect an awesome control simulation lib (which drake is without any doubts) to model as well.

For now, I use the DelayBlock to simply delay the control output by 0.1s, however since the solving time is not fixed to 0.1s and this is a very explicit work around, maybe an async computation would implement this implicitly.

Jul 04 '23 09:07 FranekStark

@FranekStark. Yes -- that's exactly the use case I have in mind. BTW -- if the MPC is using MathematicalProgram, be sure you're using e.g. UpdateCoefficients instead of constructing the MathematicalProgram on every timestep. It's much more efficient.

Jul 04 '23 19:07 RussTedrake

@RussTedrake 's HardwareStation concept enabled us to use the system and simulate framework both to simulate a MultibodyPlant, but also to send commands and receive states from a real robot.

Not being able to launch systems in separate threads/processes is becoming a blocker from using drake in both hardware execution and simulation. As other people pointed out in this issue, an MPC loop that takes 0.1 seconds, would block all other systems by 0.1 seconds as well, which may block publishing commands at a given rate to the real robot which drops the realtime rate and may cause weird other behaviors in the simulator trying to catch up. I like the idea of a ThreadSystem or something thats similar to a threaded discrete system with a set publish period where the Calc function has to be faster than the publish period.

Considering GIL and the python bindings, a multi-processed system may be more advantageous than multi-threaded?

Dec 06 '23 18:12 wrangel-bdai

@wrangel-bdai Thanks for bumping this up! BTW, I think threads launched in C++ are not limited by GIL.

Dec 12 '23 21:12 pangtao22

Unfortunately it's a little more complicated than that -- any use of the python interpreter within the process is limited by the GIL, regardless of where the thread it's running on came from. So a C++ thread that calls into the methods of a python-defined System also takes the GIL and blocks any other python for as long as it's running the methods of that System.

Because pydrake programs can be very polyglot in practice, python ends up running (and the GIL locked) in surprising places.

Dec 12 '23 21:12 ggould-tri

Dispatch the same thread, but also immediately draw a completion_time from a random distribution (using all of the random number machinery in the framework). The system would immediately schedule an UnrestrictedUpdateEvent at that completion time, and upon hitting that event, it would block until the computation thread joins, then write the results.

In the past, I've also used a different model when we wanted to simulate planners deterministically. Often, instead of the time spent to make a plan being an independent random variable, it's really a function of the planner inputs -- or at least, influenced by them. The planner can internally count things like "number of collision checks" or "number of nodes visited" or etc. and have a model for the wall clock time the real hardware would need to calculate that particular plan (possibly with some known variance). So, the time of the "plan finished" update cannot be scheduled until the plan has actually finished -- and so, knows how many collision checks it used. (This can still admit cpu concurrency if we can place a lower bound on the necessary time, or if the planner itself internally parallelizes.)

There's also a variant of this idea where only the plan input influences the update time (e.g., some distance metric between start and goal), but that's much less powerful, and often not much of an improvement over a naive input-independent distribution.

Jan 02 '24 21:01 jwnimmer-tri

I like the idea of a ThreadSystem or something thats similar to a threaded discrete system with a set publish period where the Calc function has to be faster than the publish period.

For the record -- this is already possible by writing your own LeafSystem subclass that implements the dynamics you want, using async the way you want.

As an example, we have the RgbdSensorAsync system (docs, code) that performs rendering asynchronously. That particular example is fairly complicated, due to its interface with the heavyweight RenderEngine infrastructure. A simpler system that only need to sample-and-hold some configuration posture(s) and output a goal position should be a lot simpler.

The request in this issue is primarily about (1) making those kinds of systems easy to write, and (2) offering a diagram-wide flag to toggle "block for determinism" vs non-blocking. I don't anticipate the need for any new system framework API in support of this, so it should (I hope) be possible for users to create their own systems to meet their needs in the meantime, just without the sugar.

Jan 02 '24 21:01 jwnimmer-tri

FYI There is another approach available here -- if the goal is never to block, then the "target plan" or "target posture" or whatever can be computed by an always-running planner that constantly publishes the target information, either as a message (LCM, ROS, etc) or into a thread-safe queue or mailbox, which the high-rate Diagram can treat either like a message source system (e.g., LcmSubscriberSystem), or fixed input port (port.FixValue), or etc. That means you'd have two things to manage (the low-rate planner process/thread, and the high-rate controller process/thread) but means you don't need to force the threading inside of the diagram, which might be an easier onramp for some users.

Jan 02 '24 21:01 jwnimmer-tri

drake drake copied to clipboard

Idea: modeling long-running asynchronous computation in the system's framework

drake
drake copied to clipboard