keras icon indicating copy to clipboard operation
keras copied to clipboard

ERROR:tf2onnx.tfonnx:Tensorflow op [sequential_1/bidirectional_1/forward_lstm_1/CudnnRNNV3: CudnnRNNV3] is not supported

Open josephgiting opened this issue 1 month ago • 7 comments

Open this issue following the discussion from https://github.com/keras-team/keras/issues/21533#issuecomment-3455553930

Export the following model using:

keras==3.12.0 tf2onnx-1.16.1 onnx-1.19.1 protobuf-3.20.3 tensorflow==2.19.0

  model = Sequential()
  model.add(Input(shape=(sam_sz, num_features_in))) # Use calculated input shape
  for _ in range(conv_lay):
    model.add(Conv1D(filters=filter, kernel_size=ker_size, activation=act, padding='same', data_format='channels_last'))
    model.add(MaxPooling1D(pool_size=pool_size, data_format='channels_last'))
  for _ in range(lstm_lay):
    # For intermediate LSTM layers, return sequences
    model.add(Bidirectional(LSTM(100, return_sequences = True, kernel_regularizer=l2(0.0001)))) # Use Bidirectional LSTM
    model.add(Dropout(0.3))
  model.add(Bidirectional(LSTM(100, return_sequences = False, kernel_regularizer=l2(0.0001)))) # Use Bidirectional LSTM
  model.add(Dropout(0.3))
  model.add(Dense(units=3, activation = act2, kernel_regularizer=l2(0.0001)))

with: model.export(output_path, format="onnx")

observed the following issues:

WARNING:tf2onnx.shape_inference:Cannot infer shape for sequential_1/bidirectional_1/forward_lstm_1/CudnnRNNV3: sequential_1/bidirectional_1/forward_lstm_1/CudnnRNNV3:3,sequential_1/bidirectional_1/forward_lstm_1/CudnnRNNV3:4
WARNING:tf2onnx.shape_inference:Cannot infer shape for sequential_1/bidirectional_1/backward_lstm_1/CudnnRNNV3: sequential_1/bidirectional_1/backward_lstm_1/CudnnRNNV3:3,sequential_1/bidirectional_1/backward_lstm_1/CudnnRNNV3:4
ERROR:tf2onnx.tfonnx:Tensorflow op [sequential_1/bidirectional_1/forward_lstm_1/CudnnRNNV3: CudnnRNNV3] is not supported
ERROR:tf2onnx.tfonnx:Tensorflow op [sequential_1/bidirectional_1/backward_lstm_1/CudnnRNNV3: CudnnRNNV3] is not supported
ERROR:tf2onnx.tfonnx:Unsupported ops: Counter({'CudnnRNNV3': 2})

Would appreciate your attention.

josephgiting avatar Oct 29 '25 10:10 josephgiting

Hi @josephgiting, thanks for reporting this. I've tested your code with Keras 3.12.0 and I'm not able to reproduce the error you've mentioned, please refer this gist, let me know if I missed anything.

dhantule avatar Oct 29 '25 11:10 dhantule

@dhantule Many thanks for looking into this issue so quickly. Using your link, I rerun it on Google Colab with T4 GPU runtime type, and you can observe the issue there (and with pip freeze info.): https://colab.research.google.com/gist/josephgiting/bde42a3999e9cd764984b5aec156a41e/-21799.ipynb

NOTE: This issue is not reproducible with CPU runtime type.

Please let me know if I can assist further. Kind Regards,

josephgiting avatar Oct 29 '25 12:10 josephgiting

Hi @josephgiting, thanks for letting me know, we'll look into this.

dhantule avatar Oct 29 '25 16:10 dhantule

To me this looks like a spot where tf2onnx does not support the required ops we'd need for rnn model export. I am not sure if that is something we could fix on the Keras side, but it would be great to add support for in tf2onnx.

Here's a relevant issues -> https://github.com/onnx/tensorflow-onnx/issues/2359

You might be able to skirt around the issue by passing use_cudnn=False to your RNN layers for now. Long term we'd probably want to add support in tf2onnx unless there's a particular reason not too.

Also tagging @james77777778 who added onnx support, in case he has any ideas how we could work around this during export so things work while we are missing coverage for these ops.

mattdangerw avatar Oct 29 '25 17:10 mattdangerw

Thanks @mattdangerw

I have updated my gist with use_cudnn=False and it works.

Loading the the ONNX model with onnxruntime is good too.

josephgiting avatar Oct 29 '25 18:10 josephgiting

Thanks! I think we can leave this open as the bug is still valid. But just to track while we wait for support from tfonnx.

mattdangerw avatar Oct 30 '25 04:10 mattdangerw

Experiments show that setting use_cudnn=False significantly increases training time, which likely indicates that CUDA (GPU acceleration) is not being utilized — effectively the same as running on CPU. Hence not seeing this issue.

If that’s the case, the relevant team may need to investigate and fix this issue.

josephgiting avatar Oct 31 '25 10:10 josephgiting