fxmarty
fxmarty
Yes, this was fixed in https://github.com/huggingface/optimum/pull/1780, which is not yet in a release. Please downgrade to onnx 1.15 or use optimum from source.
Hi @MrRace, if you don't want to reimplement the inference code from scratch, I advise you to use https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForSpeechSeq2Seq. An example is available there. By default, only `encoder_model.onnx` and `decoder_model_merged.onnx`...
@anilmartha thank you for the report, this is unexpected. I did not add a full CI with bf16 but should probably add one with the most used models. It appears...
Hi @clarinevong, I can not reproduce the issue on Linux, this is likely a PyTorch x Windows bug. I would recommend opening a bug report in PyTorch repo (although the...
Thank you for giving a try on Linux! I still can not reproduce, using python 3.10.14 and ``` optimum==1.18.1 torch==2.2.2+cu118 transformers==4.39.3 onnx==1.15.0 onnxruntime==1.17.1 ``` Could you share your `pip freeze`?
Hi @OE-LUCIFER, can you give me a reference for HelpingAI? I can not find it in Transformers.
@kazssym let me know if you'd like a review.
Hi @kapilsingh93, thank you, I can reproduce (only on CUDA device though), this is not expected, sorry for the issue. Let me fix shortly.
@kapilsingh93 Interestingly downgrading to torch 2.0.1 fixes the issue... It may be a torch regression. I hit the issue even with `torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=True, enable_mem_efficient=False)`, only on CUDA device
@kapilsingh93 It would help to debug if you can confirm whether using torch 2.0.1 helps bringing back equal performance.