model-optimization icon indicating copy to clipboard operation
model-optimization copied to clipboard

QAT model saving bug : KeyError: '__inference_depthwise_conv2d_layer_call_fn_126

Open peri044 opened this issue 4 years ago • 19 comments

Describe the bug Please download the scripts to reproduce from : https://drive.google.com/drive/folders/15cajAZ9sAZ2Uyix8sDVSYku6QCqDCec7?usp=sharing

Command to run : python sample_qat.py.

I have a simple model with input layer and a depthwise conv2d layer. I quantize this model by adding quantize_and_dequantize nodes at the input of depthwiseconv2d layer (commented in the code). When I save the model and load it back, I see the following

  File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 544, in <lambda>
    "function": lambda: self._recreate_function(proto.function),
  File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 586, in _recreate_function
    proto, self._concrete_functions), setattr
  File "/home/dperi/Downloads/py3/lib/python3.6/site-packages/tensorflow/python/saved_model/function_deserialization.py", line 295, in recreate_function
    concrete_function_objects.append(concrete_functions[concrete_function_name])
KeyError: '__inference_depthwise_conv2d_layer_call_and_return_conditional_losses_117'

System information

TensorFlow version (installed from source or binary): 2.5 (Tried with 2.6 as well)

TensorFlow Model Optimization version (installed from source or binary):

Saved model loading fails especially for Depthwise convolution. It works fine for regular conv.

peri044 avatar Oct 22 '21 06:10 peri044

Hi @Xhark , I also have the same bug when I want to quantize Mobilenet v2.

System information

TensorFlow version (installed from binary): 2.5.0 => TensorFlow Model Optimization version (installed from binary): 0.6.0

TensorFlow version (installed from binary): 2.5.1 => TensorFlow Model Optimization version (installed from binary): 0.7.0

TensorFlow version (installed from binary): 2.4.0 => TensorFlow Model Optimization version (installed from binary): 0.7.0

Python version: 3.8.12

Jia-HongHenryLee avatar Oct 25 '21 16:10 Jia-HongHenryLee

Hi @Xhark and @peri044 ,

I use the following environment to solve my problem. System information TensorFlow version (installed from binary): tf-nightly-gpu 2.5.0.dev20201202 (https://www.cnpython.com/pypi/tf-nightly-gpu/download) TensorFlow Model Optimization version (installed from binary): 0.6.0 Python version: 3.8.12

Jia-HongHenryLee avatar Oct 26 '21 05:10 Jia-HongHenryLee

Hi peri044@ and Jia-HongHenryLee@

I'm looking into it now, but there are a couple of workarounds. First, it seems to save correctly if you use

model.save('export_dir', save_format='h5')

I think this is caused by incorrect shape handling for the depthwise kernel quantization parameters, which results in functions not being traced/merged correctly.

Thanks for reporting this.

daverim avatar Nov 01 '21 05:11 daverim

Thank you @daverim for addressing this. Can you let me know when this would be resolved or if there's an active PR for this ? I haven't tried h5 format, since I'm using saved model format to pass it through TF2ONNX (with custom utilities) for processing.

peri044 avatar Nov 07 '21 20:11 peri044

Hello @daverim, can you please suggest some pointers for me on how to fix this locally (using saved_model format)? Which files/functions to look at ? Thanks !!

peri044 avatar Nov 15 '21 17:11 peri044

Hey @peri044. If your ultimate goal is to convert the model into TFLite format you can pass ConcreteFunction around. from_concrete_functions of TFLiteConverter works just fine for me.

ChanZou avatar Nov 15 '21 19:11 ChanZou

Hello @ChanZou My ultimate goal is to use the saved_model format (if it works) and pass it through TF2ONNX to convert it into ONNX graph. TF2ONNX accepts saved_model format for graphs currently.

peri044 avatar Nov 15 '21 20:11 peri044

Thank you @daverim for addressing this. Can you let me know when this would be resolved or if there's an active PR for this ? I haven't tried h5 format, since I'm using saved model format to pass it through TF2ONNX (with custom utilities) for processing.

Hello @daverim, any suggestions on how to resolve this would be appreciated. Thanks !!

peri044 avatar Jan 06 '22 00:01 peri044

Hi sorry for the delay.

I just tested your sample code and it seems to be resolved now. There are some warnings about un-traced functions.

Using: tf=2.8.0-dev20210930, tfmot=tensorflow_model_optimization=0.7.0

Please try and see if it works for you. Thanks, David

daverim avatar Jan 06 '22 02:01 daverim

Thanks @daverim. That works now.

peri044 avatar Jan 26 '22 00:01 peri044

@daverim I encountered the same error log for SeparableConv2D using TF 2.8.0 (no error with DepthwiseConv2D in that TF version):

...
Traceback (most recent call last):
  File "/home/PycharmProjects/tensorrt_qat/examples/mobilenet/run_qat_workflow.py", line 156, in <module>
    main(verbose=True)
  File "/home/PycharmProjects/tensorrt_qat/examples/mobilenet/run_qat_workflow.py", line 142, in main
    tf.keras.models.save_model(q_model, os.path.join(qat_save_finetuned_weights, "saved_model"))
  File "/home/PycharmProjects/tensorrt_qat/venv38_tf2.8_newPR/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/PycharmProjects/tensorrt_qat/venv38_tf2.8_newPR/lib/python3.8/site-packages/tensorflow/python/saved_model/save.py", line 403, in map_resources
    raise ValueError(
ValueError: Unable to save function b'__inference_block2_sepconv1_layer_call_fn_670910' because it captures graph tensor Tensor("xception/quant_block2_sepconv1/LastValueQuant_1/QuantizeAndDequantizeV4:0", shape=(3, 3, 64, 1), dtype=float32) from a parent function which cannot be converted to a constant with `tf.get_static_value`.

Do you have any idea what caused the error in DepthwiseConv2D and if the same fix would work for SeparableConv2D? Thank you!

gcunhase avatar Mar 31 '22 07:03 gcunhase

The best way to avoid this issue is to disable the layer tracing when creating the SavedModel, but you'll have to manually define the serving_default function (this is the default name that is used in TF2ONNX).

@tf.function
def predict(*args, **kwargs):
  return model(*args, **kwargs)

arg_spec, kwarg_spec = model.save_spec()
model.save(path, save_traces=False, signatures={
  "serving_default": predict.get_concrete_function(*arg_spec, **kwarg_spec)
})

k-w-w avatar May 18 '22 22:05 k-w-w

Hi @k-w-w thank you for your feedback! This specific issue (for DepthwiseConv) has been solved, as mentioned in a comment on Jan 26th above, but the same issue persists for SeparableConv here.

I tried your suggestion, but it did not solve my issue, since the problem is not with tf2onnx, but with saving the TF model. Do you have any additional suggestions please? Thank you!

gcunhase avatar May 19 '22 01:05 gcunhase

@gcunhase Are you getting the same error even with save_traces=False?

k-w-w avatar May 19 '22 03:05 k-w-w

@k-w-w yes

gcunhase avatar May 19 '22 03:05 gcunhase

@gcunhase can you paste the error trace?

k-w-w avatar May 19 '22 17:05 k-w-w

@k-w-w :

...
Traceback (most recent call last):
  File "/home/nvidia/PycharmProjects/nvbugs/internal_filed/tf_key_inference_bug/TF_bug_separableconv2d/sample.py", line 24, in <module>
    model.save(model_save_path)
  File "/home/nvidia/PycharmProjects/nvbugs/venv38_trt_regression/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/nvidia/PycharmProjects/nvbugs/venv38_trt_regression/lib/python3.8/site-packages/tensorflow/python/saved_model/save.py", line 403, in map_resources
    raise ValueError(
ValueError: Unable to save function b'__inference_separable_conv2d_layer_call_fn_961' because it captures graph tensor Tensor("model/quant_separable_conv2d/LastValueQuant_1/QuantizeAndDequantizeV4:0", shape=(3, 3, 3, 1), dtype=float32) from a parent function which cannot be converted to a constant with `tf.get_static_value`.

gcunhase avatar May 20 '22 04:05 gcunhase

This bug also has the reproducible code, so we can move our discussion there if you agree.

gcunhase avatar May 20 '22 04:05 gcunhase

This bug can be closed for DepthwiseConv2D. For Conv2DTranspose and SeparableConv2D, please move the discussion here. Thank you!

gcunhase avatar Jul 21 '22 16:07 gcunhase