onnxruntime [Training] IR version incompatibility in artifact generation for on-device training

Describe the issue

Trying to execute the example notebook provided in on_device_training/desktop/python/mnist.ipynb results in an error about IR version incompatibility, stating the optimiser only supports version <=9 while the generated artifacts use version 10.

To reproduce

Install on-device training dependencies for offline stage as instructed here
Install additional dependencies to execute the notebook
```
ipykernel
ipywidgets
torch
torchvision
matplotlib
netron
evaluate
```
(initially added them to requirements.txt, then installed one-by-one after each ImportError to check if that wasn't the problem)
Execute notebook until the first cell of section "3 - Initialize Module and Optimizer"; no errors should be raised

Execute first cell of the section

# Create checkpoint state.
state = CheckpointState.load_checkpoint("data/checkpoint")

# Create module.
model = Module("data/training_model.onnx", state, "data/eval_model.onnx")

# Create optimizer.
optimizer = Optimizer("data/optimizer_model.onnx", model)

which should raise the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[18], line 8
      5 model = Module(\"data/training_model.onnx\", state, \"data/eval_model.onnx\")
      7 # Create optimizer.
----> 8 optimizer = Optimizer(\"data/optimizer_model.onnx\", model)

File venv/lib/python3.12/site-packages/onnxruntime/training/api/optimizer.py:24, in Optimizer.__init__(self, optimizer_uri, module)
     23 def __init__(self, optimizer_uri: str | os.PathLike, module: Module):
---> 24     self._optimizer = C.Optimizer(
     25         os.fspath(optimizer_uri), module._state._state, module._device, module._session_options
     26     )

RuntimeError: /onnxruntime_src/orttraining/orttraining/training_api/optimizer.cc:273 void onnxruntime::training::api::Optimizer::Initialize(const onnxruntime::training::api::ModelIdentifiers&, const std::vector<std::shared_ptr<onnxruntime::IExecutionProvider> >&, gsl::span<OrtCustomOpDomain* const>) [ONNXRuntimeError] : 1 : FAIL : Load model from data/optimizer_model.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

Urgency

I need to develop on top of this for a project due next month.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.3

PyTorch Version

2.3.0+cu121

Execution Provider

ROCm

Execution Provider Library Version

ROCm 6.0.2

May 19 '24 10:05 tomaz-suller

I suspect some incompatibility due to versions of system or other Python packages could be to blame, since I'm running EndeavourOS (rolling release, Arch-based) with Python 3.12.3.

I tried downgrading the onnx to 1.14.1 but I got a build error from absl complaining my compiler didn't support C++14 (which is weird since it should but I just gave up then).

May 19 '24 10:05 tomaz-suller

Just checked and also in Google Colab I get the same error following the same steps I mentioned, but running on CPU and in Python 3.10.12

May 19 '24 10:05 tomaz-suller

@tomaz-suller what version of ONNX are you using? If you haven't already, could you try with onnx==1.15.0? Also, what version of onnxruntime-training are you using?

May 19 '24 23:05 carzh

It does work with onnx==1.15.0 in Colab. I'm using onnx-training-cpu==1.17.3

Edit: locally, I get the ABSL build error about C++14 I mentioned when trying to downgrade, but then the issue isn't with ONNX anymore.

May 20 '24 04:05 tomaz-suller

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

Jun 19 '24 15:06 github-actions[bot]

Hello, this problem still exists for onnx==1.19. How should it be solved?

Apr 05 '25 08:04 GabbySuwichaya