taichi icon indicating copy to clipboard operation
taichi copied to clipboard

[AOT] Serialize compute graph in python and load in C++ runtime

Open ailzhang opened this issue 2 years ago • 4 comments

Copied from #4615 :P : This would significantly reduce the efforts to porting a demo written in Taichi to non-Python environments. In essence, the AOT module will save not only the Taichi kernels, but also the host logic invoking them. [Will add more context as we start adding this feature]

ailzhang avatar Apr 14 '22 07:04 ailzhang

Update 04/27 (working in progress): We've converged on the a API style for the simplest case.

# One way to produce compute graph automatically.
@ti.aot.graph
def run_sim():
    kernel1()
    kernel2()

# Or more explicit
g = ti.aot.Graph()
func = g.add_func('func')
func.append(kernel, ...)

mod = ti.aot.Module()
mod.add_graph(run_sim)

## in C++
# Note that serialization format is implementation detail that we haven't covered
# But it's pretty flexible.
graph = aot_module->load_graph("...");
graph->run(host_ctx);

Note it's easy to support static control flow, but we're spending some time exploring how we're supposed to support dynamic control flow. Is it going to be in graph or out-of-graph? (concepts borrowed from "Dynamic Control Flow in Large-Scale Machine Learning" paper)

  • in-graph:encodes control-flow decisions as operations in the dataflow graph
  • out-of-graph :implements control-flow decisions in the client process, using control-flow primitives of a host language like Python.
  1. The out-of-graph approach is more like the traditional tracing approach.
  • Loops are unrolled and branch is selected based on python runtime value.
  • You cannot prevent users from using device to host sync for control flows, which might hurt performance a lot.
  1. For in-graph approach we need to decide whether the control flow support is explicit or implicit:
  • explicit: more like JAX style
  • implicit: more like TorchScript style

We're currently building a toy prototype just to get some sense of usability and coverage using the explicit in-graph approach.

ailzhang avatar Apr 26 '22 16:04 ailzhang

Update 05/16 (working in progress):

We're considering adding support for static compute graph construction and its execution.

Goals:

  • The graph can be serialized, then deployed and run in an environment without python runtime.
  • It'd be super helpful if we can run the graph in python frontend as if it was deployed, to make the AOT debugging experience better.
  • Normal taichi python frontend users can opt in to use graph execution mode as well. We've noticed pretty heavy overhead from python->C++ communication especially for small kernels on a powerful GPU. Launching a graph instead of individual small kernels from python can dramatically reduce the overhead for these light kernels.

Non-goals:

  • Graph mode won't be as flexible as current python JIT (dynamic graph) as it won't allow host execution (like returning a value from a taichi kernel to python and condition on it).

Key terminologies (credits to @bobcao3 ) :

  • Each graph is composed of a series of nodes. There are currently three types of nodes: Dispatch, Sequential, and Conditional. When a graph is invoked, we evaluate the the graph by evaluating all the individual nodes sequentially. Running a node don't return any value to python.
  • Dispatch is a basic node that executes a Taichi kernel with a specific set of arguments.
  • Sequential node can be considered as a list of Nodes, where each node is evaluated sequentially.
  • Conditional: to be added.
  • You can view a graph as a container with a root sequential node and it also manages the alloc/dealloc of the nodes inside the graph.
  • Symbolic arguments: users are required to create symbolic arguments and defines the data flow when building the graph. When you invoke the graph, it's required to pass a runtime value to the corresponding arguments. Currently only scalar/vector/matrix or ndarray are supported as runtime values.

Proposed APIs and a typical workflow :

** Note these are not finalized, please feel free to comment if you have any suggestions!

Code below is truncated to make it easier to understand, please see "prototype" section for a full example.

  • Build a compute graph
g_update = ti.graph.Graph('update')
substep = g_update.create_sequential()
float2 = ti.types.vector(2, float)
sym_grid_v = ti.graph.Arg('grid_v', dtype=float2) 
... 
substep.emplace(substep_reset_grid, sym_grid_v, sym_grid_m) # TODO: consider using append instead of emplace, just a bit more checks in python. 
substep.emplace(substep_p2g, sym_x, sym_v, sym_C, sym_J, sym_grid_v,
                sym_grid_m)

for i in range(500):
    g_update.append(substep)
  • Compile
g_update.compile()
  • Execute in Python frontend
x = ti.Vector.ndarray(2, ti.f32, shape=(n_particles))
v = ti.Vector.ndarray(2, ti.f32, shape=(n_particles))

C = ti.Matrix.ndarray(2, 2, ti.f32, shape=(n_particles))
J = ti.ndarray(ti.f32, shape=(n_particles))
grid_v = ti.Vector.ndarray(2, ti.f32, shape=(n_grid, n_grid))
grid_m = ti.ndarray(ti.f32, shape=(n_grid, n_grid))
g_update.run({'x': x, 'v': v, 'C': C,  'J': J, 'grid_v': grid_v, 'grid_m': grid_m})
  • Serialize to preprare for running in C++
    mod = ti.aot.Module(ti.vulkan)
    mod.add_graph(g_update)
    mod.save('shaders', '')
  • Run in C++
    std::unique_ptr<taichi::lang::aot::Module> module = taichi::lang::aot::Module::load(taichi::Arch::vulkan, mod_params);
    auto g_update = module->load_graph("update");

    std::unordered_map<std::string, taichi::lang::aot::IValue> args; // C++ version, will change this to C API

    args.insert({"grid_v", taichi::lang::aot::IValue(devalloc_grid_v, N_GRID * N_GRID * 2 * sizeof(float), {N_GRID, N_GRID, 2})});
    args.insert({"grid_m", taichi::lang::aot::IValue(devalloc_grid_m, N_GRID * N_GRID * sizeof(float), {N_GRID, N_GRID})});

    g_update->run(args);

Preliminary result

In terms of reducing python launch overhead, we noticed 3x speedup (15fps -> 45fps) when running mpm88 example (500 substeps) on a RTX3090 after switching to graph execution mode.

Q&A:

  • How about control flows? Since graph execution mode is mainly designed to maximize performance, dynamic control flow (which involves host execution is temporarily out of scope). But we might add support for ti.cond(field_val, true_clause, false_clause) on supported hardwares.

  • What do you support as graph arguments? Currently scalars and ndarray are supported. Support for taichi field will be added after https://github.com/taichi-dev/taichi/blob/master/docs/rfcs/20220413-aot-for-all-snode.md is implemented.

Prototype:

Checkout a proof of concept implementation and a mpm88 example in C++.

ailzhang avatar May 16 '22 04:05 ailzhang

Thanks for summarizing this & looks great!

substep = g_update.create_sequential()

Have we discussed whether a Sequential node is attached to a specific graph instance, or can be a general node? In the later case, we would write substep = ti.graph.create_sequential(). I feel like making it a general node is more intuitive, but there could be factors that I forgot to consider.

substep.emplace(substep_reset_grid, sym_grid_v, sym_grid_m)

I still feel like emplace is too C++. If we view the graph as declarative, I think it could be substep.call(kernel, args...).

k-ye avatar May 17 '22 02:05 k-ye

FYI you can find a few examples of loading serialized mpm88/sph/stable fluid demos and launch in C++ at https://github.com/ailzhang/taichi-aot-demo. We're in the process of adjusting & finalizing the public facing API, will send more updates soon!

ailzhang avatar Jun 14 '22 03:06 ailzhang