Results 29 comments of Kiran R

@JoeREISys I ran the same script in colab, I'm getting the following results. maybe it's the device issue. ``` Downloading: 100% 1.43k/1.43k [00:00

thank you! @tobigue was able to export `mbart` to onnx, he might be able to help.

cool! I also had some issues with `1.7.0`, while using the `onnxruntime==1.7.0` for quantizing it created extra models here's the [issue ](https://github.com/microsoft/onnxruntime/issues/6888). by applying `optimize_model=False` was able to fix it.

constant folding replaces some of the operations that have all constant inputs, not clear why it creates embedding twice in bart. in t5 i did not face any issue with...

also I noticed that in the notebook ```python input_names = [x.name for x in self.decoder.get_inputs()] inputs = [ input_ids.cpu().numpy(), attention_mask.cpu().numpy(), ] + [ tensor.cpu().numpy() for tensor in flat_past_key_values ] decoder_inputs...

>It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196 you're right...

for GPU you can use the [`onnxruntime-gpu` ](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#contents)library, but it does not support quantization. so you won't have the advantage of reduced model size during inference. [here's](https://github.com/microsoft/onnxruntime/blob/dfe42e185c6c6de68177db8ecf307645ce831aec/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb) an example implementation...

are you using it for GPU?

sorry, the library does not support GPU yet, but the issue is similar to https://github.com/microsoft/onnxruntime/issues/3113 for CPU are you facing the same issue?

it looks like the issue is in onnxruntime itself, I suggest you to create an issue there.