optimum
optimum copied to clipboard
Merged ONNX decoder next steps
Feature request
The PR https://github.com/huggingface/optimum/pull/647 was merged that adds support for merged without/with past decoder as a single ONNX file, along with inference in ORTModelForCausalLM.
Some key steps are still remaining:
- [x] Support IO Binding https://github.com/huggingface/optimum/pull/797
- [ ] Find a way to hide the prints
2023-02-10 16:29:24.868007832 [W:onnxruntime:, graph.cc:3487 CleanUnusedInitializersAndNodeArgs] Removing initializer '/transformer/h.4/attn/Constant_18_output_0'. It is not used by any node and should be removed from the model., tracked in https://github.com/microsoft/onnxruntime/issues/14694 - [ ] Fix the generation of dummy past key values for
bloomthat is currently ugly - [x] Investigate why
codegendoes not support-with-pastintasks.py - [x] Support the merge for Seq2Seq model
- [x] Support ONNX Runtime optimizations along with merged models https://github.com/huggingface/optimum/pull/807
Motivation
Reduce memory usage
Your contribution
/
Hi @un-certainty , yes if you are using CUDAExecutionProvider, using IO Binding is probably helpful. I don't have a proper benchmark at hand though.
Also I wonder if the caches are perserved on GPU, will it potentially cause a memory explosion? When the QPS is high and sequences are long, there will be a lot of intermediate tensors. I'm not sure if this could lead to OOM.
I would say it could, yes.