optimum icon indicating copy to clipboard operation
optimum copied to clipboard

Merged ONNX decoder next steps

Open fxmarty opened this issue 2 years ago • 1 comments

Feature request

The PR https://github.com/huggingface/optimum/pull/647 was merged that adds support for merged without/with past decoder as a single ONNX file, along with inference in ORTModelForCausalLM.

Some key steps are still remaining:

  • [x] Support IO Binding https://github.com/huggingface/optimum/pull/797
  • [ ] Find a way to hide the prints 2023-02-10 16:29:24.868007832 [W:onnxruntime:, graph.cc:3487 CleanUnusedInitializersAndNodeArgs] Removing initializer '/transformer/h.4/attn/Constant_18_output_0'. It is not used by any node and should be removed from the model., tracked in https://github.com/microsoft/onnxruntime/issues/14694
  • [ ] Fix the generation of dummy past key values for bloom that is currently ugly
  • [x] Investigate why codegen does not support -with-past in tasks.py
  • [x] Support the merge for Seq2Seq model
  • [x] Support ONNX Runtime optimizations along with merged models https://github.com/huggingface/optimum/pull/807

Motivation

Reduce memory usage

Your contribution

/

fxmarty avatar Feb 15 '23 15:02 fxmarty

Hi @un-certainty , yes if you are using CUDAExecutionProvider, using IO Binding is probably helpful. I don't have a proper benchmark at hand though.

Also I wonder if the caches are perserved on GPU, will it potentially cause a memory explosion? When the QPS is high and sequences are long, there will be a lot of intermediate tensors. I'm not sure if this could lead to OOM.

I would say it could, yes.

fxmarty avatar Feb 23 '23 13:02 fxmarty