onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

Introduce Training C++ Apis

Open baijumeswani opened this issue 2 years ago • 2 comments

This pull request introduces C++ apis for on-device training scenarios. The pull request also updates the existing sample trainer to use the new C++ interface.

Usage of the api

#include "onnxruntime_training_c_api.h"

auto checkpoint_state = Ort::LoadCheckpoint(...);
auto session = Ort::TrainingSession(...);

// training loop
...
    auto outputs = session.TrainStep(...);
    session.OptimizerStep();
    session.ResetGrad();
...

session.EvalStep(...);

Ort::SaveCheckpoint(...);

baijumeswani avatar Sep 16 '22 17:09 baijumeswani

This pull request introduces 5 alerts when merging 7fb6107fc2f38e8a2d82e9f1bb3ee3b2e1d685ea into 14365b67a0de0acc1e6423477d7363bd5a98f0d7 - view on LGTM.com

new alerts:

  • 5 for Uncontrolled data used in path expression

lgtm-com[bot] avatar Sep 19 '22 19:09 lgtm-com[bot]

Hi @baijumeswani, this is very interesting, thanks for adding a C++ API. Looking at the trainer, I see that I need to provide an evaluation graph, a training graph, and an optimizer graph as inputs. How should those graphs be generated (i.e. what is the intended use of the trainer)? Is it possible to generate those graphs using onnx/onnxruntime directly, without relying on other ML frameworks? Could you provide any pointer to documentation/test/header file that could help me on that? Thanks in advance!

Mattia

mlupetti avatar Sep 21 '22 08:09 mlupetti

Hi @mlupetti. Sorry for the late reply. The trainer requires 4 files to perform training:

  1. The checkpoint file.
  2. The training onnx model.
  3. The eval onnx model (optional input).
  4. The optimizer onnx model (optional input).

These files can be generated using the python offline tooling. The documentation for generating these files can be found here.

The main purpose behind these APIs is to provide a way to perform training through onnxruntime; in particular one of the main goals is to perform training on the edge. Let me know if you have more questions.

baijumeswani avatar Sep 26 '22 16:09 baijumeswani

Thanks for your answer, that clarifies a lot of things. I'm trying to train a model only from C++ using onnxruntime, and that's why this branch was really interesting.

I'm able to build a graph in C++, similarly to what onnx_helper.py does (that's easy, it's just a matter of setting up the right protos), the blocker for me here is that the offline tools rely on the GradientGraphBuilder, which is not part of the API.

Any plans to exposing a fully fledged C++ training API? I mean one with which one can build a graph, add a loss and an optimizer and run a training sessions.. I see the classes are all there already.

Also, there are two different training sessions in the codebase now, one here and the other here, is one of the two getting deprecated at some point?

mlupetti avatar Sep 28 '22 06:09 mlupetti

The goal for this two-step process is to make deploying the training solution easy for on-device training scenarios where the expectation is that the users can generate the files in an offline step on the server and deploy these pre-generated files to a device for the actual training.

What you're looking for is a complete C++ training solution where the offline python step can also be performed in C++. This has not been planned yet. Will speak internally to see if we can/should plan this. Tagging @askhade @kshama-msft for awareness.

The training session defined here is no longer under active development and will be deprecated.

baijumeswani avatar Sep 28 '22 17:09 baijumeswani

@mlupetti : In order to generate the training artifacts you need to build the training python package from source with --enable_trianing_on_device option. A sample build command for linux will look something like this:

image

I think a cpu only package should also work. Try to skip "cuda" options.

askhade avatar Sep 29 '22 03:09 askhade

FYI: https://github.com/microsoft/onnxruntime/pull/13215 perhaps, you could hold off on this?

yuslepukhin avatar Oct 04 '22 20:10 yuslepukhin

FYI: #13215 perhaps, you could hold off on this?

ok, will wait for #13215 to merge and update this PR accordingly.

baijumeswani avatar Oct 04 '22 23:10 baijumeswani

The goal for this two-step process is to make deploying the training solution easy for on-device training scenarios where the expectation is that the users can generate the files in an offline step on the server and deploy these pre-generated files to a device for the actual training.

What you're looking for is a complete C++ training solution where the offline python step can also be performed in C++. This has not been planned yet. Will speak internally to see if we can/should plan this. Tagging @askhade @kshama-msft for awareness.

The training session defined here is no longer under active development and will be deprecated.

I had a look at the onnx offline tooling and it's actually very very similar to what I wrote in C++, the code is basically a collection of utilities to write onnx proto messages and check that they actually fit together. The only missing piece I'd need is the GradientGraphBuilder in the training API. With that I'm able to create all the artifacts needed by the online training pipeline. Would it be possible to add that to the training API? I can contribute to that if needed.

mlupetti avatar Oct 11 '22 12:10 mlupetti