[Training] IR version incompatibility in artifact generation for on-device training
Describe the issue
Trying to execute the example notebook provided in on_device_training/desktop/python/mnist.ipynb results in an error about IR version incompatibility, stating the optimiser only supports version <=9 while the generated artifacts use version 10.
To reproduce
- Install on-device training dependencies for offline stage as instructed here
- Install additional dependencies to execute the notebook
(initially added them toipykernel ipywidgets torch torchvision matplotlib netron evaluaterequirements.txt, then installed one-by-one after eachImportErrorto check if that wasn't the problem) - Execute notebook until the first cell of section "3 - Initialize Module and Optimizer"; no errors should be raised
- Execute first cell of the section
which should raise the following error# Create checkpoint state. state = CheckpointState.load_checkpoint("data/checkpoint") # Create module. model = Module("data/training_model.onnx", state, "data/eval_model.onnx") # Create optimizer. optimizer = Optimizer("data/optimizer_model.onnx", model)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[18], line 8
5 model = Module(\"data/training_model.onnx\", state, \"data/eval_model.onnx\")
7 # Create optimizer.
----> 8 optimizer = Optimizer(\"data/optimizer_model.onnx\", model)
File venv/lib/python3.12/site-packages/onnxruntime/training/api/optimizer.py:24, in Optimizer.__init__(self, optimizer_uri, module)
23 def __init__(self, optimizer_uri: str | os.PathLike, module: Module):
---> 24 self._optimizer = C.Optimizer(
25 os.fspath(optimizer_uri), module._state._state, module._device, module._session_options
26 )
RuntimeError: /onnxruntime_src/orttraining/orttraining/training_api/optimizer.cc:273 void onnxruntime::training::api::Optimizer::Initialize(const onnxruntime::training::api::ModelIdentifiers&, const std::vector<std::shared_ptr<onnxruntime::IExecutionProvider> >&, gsl::span<OrtCustomOpDomain* const>) [ONNXRuntimeError] : 1 : FAIL : Load model from data/optimizer_model.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9
Urgency
I need to develop on top of this for a project due next month.
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.17.3
PyTorch Version
2.3.0+cu121
Execution Provider
ROCm
Execution Provider Library Version
ROCm 6.0.2
I suspect some incompatibility due to versions of system or other Python packages could be to blame, since I'm running EndeavourOS (rolling release, Arch-based) with Python 3.12.3.
I tried downgrading the onnx to 1.14.1 but I got a build error from absl complaining my compiler didn't support C++14 (which is weird since it should but I just gave up then).
Just checked and also in Google Colab I get the same error following the same steps I mentioned, but running on CPU and in Python 3.10.12
@tomaz-suller what version of ONNX are you using? If you haven't already, could you try with onnx==1.15.0? Also, what version of onnxruntime-training are you using?
It does work with onnx==1.15.0 in Colab. I'm using onnx-training-cpu==1.17.3
Edit: locally, I get the ABSL build error about C++14 I mentioned when trying to downgrade, but then the issue isn't with ONNX anymore.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Hello, this problem still exists for onnx==1.19. How should it be solved?