optimum Merged ONNX decoder next steps

Merged ONNX decoder next steps

Open fxmarty opened this issue 2 years ago • 1 comments

Feature request

The PR https://github.com/huggingface/optimum/pull/647 was merged that adds support for merged without/with past decoder as a single ONNX file, along with inference in ORTModelForCausalLM.

Some key steps are still remaining:

[x] Support IO Binding https://github.com/huggingface/optimum/pull/797
[ ] Find a way to hide the prints 2023-02-10 16:29:24.868007832 [W:onnxruntime:, graph.cc:3487 CleanUnusedInitializersAndNodeArgs] Removing initializer '/transformer/h.4/attn/Constant_18_output_0'. It is not used by any node and should be removed from the model., tracked in https://github.com/microsoft/onnxruntime/issues/14694
[ ] Fix the generation of dummy past key values for bloom that is currently ugly
[x] Investigate why codegen does not support -with-past in tasks.py
[x] Support the merge for Seq2Seq model
[x] Support ONNX Runtime optimizations along with merged models https://github.com/huggingface/optimum/pull/807

Motivation

Reduce memory usage

Your contribution

Feb 15 '23 15:02 fxmarty

Hi @un-certainty , yes if you are using CUDAExecutionProvider, using IO Binding is probably helpful. I don't have a proper benchmark at hand though.

Also I wonder if the caches are perserved on GPU, will it potentially cause a memory explosion? When the QPS is high and sequences are long, there will be a lot of intermediate tensors. I'm not sure if this could lead to OOM.

I would say it could, yes.

Feb 23 '23 13:02 fxmarty

optimum optimum copied to clipboard

Merged ONNX decoder next steps

Feature request

Motivation

Your contribution

optimum
optimum copied to clipboard