optimum icon indicating copy to clipboard operation
optimum copied to clipboard

[WIP] Support open-clip onnx export

Open isaac-chung opened this issue 2 years ago • 24 comments

What does this PR do?

Fixes # (issue) https://github.com/huggingface/optimum/issues/1450

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you make sure to update the documentation with your changes?
  • [ ] Did you write any new necessary tests?

isaac-chung avatar Oct 18 '23 13:10 isaac-chung

@fxmarty here's the WIP. Currently when I run the export command, I get

(v1) isaacchung@Isaacs-MBP optimum % optimum-cli export onnx -m laion/CLIP-ViT-B-32-laion2B-s34B-b79K --framework pt clip_onnx
Automatic task detection to zero-shot-image-classification.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.
/Users/isaacchung/work/transformers/src/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
  warnings.warn(
Using the export variant default. Available variants are:
        - default: The default ONNX variant.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["bos_token_id"]` will be overriden.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["eos_token_id"]` will be overriden.
Using framework PyTorch: 2.0.0
Traceback (most recent call last):
  File "/Users/isaacchung/virtualenv/v1/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
  File "/Users/isaacchung/work/optimum/optimum/commands/optimum_cli.py", line 163, in main
    service.run()
  File "/Users/isaacchung/work/optimum/optimum/commands/export/onnx.py", line 239, in run
    main_export(
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/__main__.py", line 505, in main_export
    _, onnx_outputs = export_models(
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/convert.py", line 752, in export_models
    export(
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/convert.py", line 855, in export
    export_output = export_pytorch(
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/convert.py", line 542, in export_pytorch
    dummy_inputs = config.generate_dummy_inputs(framework="pt", **input_shapes)
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/base.py", line 454, in generate_dummy_inputs
    dummy_inputs_generators = self._create_dummy_input_generator_classes(**kwargs)
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/base.py", line 222, in _create_dummy_input_generator_classes
    first_inputs_gen = self.DUMMY_INPUT_GENERATOR_CLASSES[0](self.task, self._normalized_config, **kwargs)
  File "/Users/isaacchung/work/optimum/optimum/utils/input_generators.py", line 343, in __init__
    self.vocab_size = normalized_config.vocab_size
  File "/Users/isaacchung/work/optimum/optimum/utils/normalized_config.py", line 109, in __getattr__
    return super().__getattr__(attr_name)
  File "/Users/isaacchung/work/optimum/optimum/utils/normalized_config.py", line 69, in __getattr__
    raise AttributeError(f'Could not find the attribute named "{leaf_attr_name}" in the normalized config.')
AttributeError: Could not find the attribute named "vocab_size" in the normalized config.

I'm stepping away for a bit this evening but I will continue afterwards.

isaac-chung avatar Oct 18 '23 14:10 isaac-chung

@isaac-chung Apology for the convoluted error. Here is a patch for this one: https://github.com/isaac-chung/optimum/pull/2

There is then an other error about the naming of the model inputs. One approach you could take is similar to https://github.com/huggingface/optimum/blob/2f7d396643178ee2adf7868d0e013bb23a75e678/optimum/exporters/onnx/model_configs.py#L801 & https://github.com/huggingface/optimum/blob/2f7d396643178ee2adf7868d0e013bb23a75e678/optimum/exporters/onnx/base.py#L429 so that we keep the same naming as transformers.

Hopefully then input / outputs of the open_clip model is the same as transformers as well and we don't run into more issues.

fxmarty avatar Oct 18 '23 16:10 fxmarty

@fxmarty amazing, thank you! 🙌 I've merged your patch into this branch, and will continue on reconciling the inputs / output naming differences.

In terms of tests, should we use one of these LAION models (the smallest)? Or something smaller?

isaac-chung avatar Oct 18 '23 16:10 isaac-chung

Yes, it appears there are no small open clip models on the Hub. Probably the most reasonable option (if you don't want to go into the hassle of addind a tiny testing model) is to just add a nightly test (@slow) for open clip with a somewhat "small" model.

fxmarty avatar Oct 18 '23 17:10 fxmarty

Yep the @slow option sounds good. Perhaps the exercise of adding a tiny testing model can be separate.

For input mapping, the key matching was straightforward. However, it seems like there's some shape mismatch. Looks like CLIP tokenizers pad inputs to len=77 (see colab notebook), but inputs["input_ids"] shape has (batch_size, 16):

...
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/model_patcher.py", line 118, in patched_forward
    outputs = self.orig_forward(*args, **kwargs)
  File "/Users/isaacchung/virtualenv/v1/lib/python3.9/site-packages/open_clip/model.py", line 248, in forward
    text_features = self.encode_text(text, normalize=True) if text is not None else None
  File "/Users/isaacchung/virtualenv/v1/lib/python3.9/site-packages/open_clip/model.py", line 233, in encode_text
    x = x + self.positional_embedding.to(cast_dtype)
RuntimeError: The size of tensor a (16) must match the size of tensor b (77) at non-singleton dimension 1

It does seem like maybe_save_preprocessors was run and the preprocessors were saved, but maybe not initialized properly? Have you seen this kind of mismatch before?

tokenizer: CLIPTokenizerFast(name_or_path='laion/CLIP-ViT-B-32-laion2B-s34B-b79K', vocab_size=49408, model_max_length=77, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
        49406: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
        49407: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

isaac-chung avatar Oct 18 '23 18:10 isaac-chung

@isaac-chung Thank you. It appears that the tokenize function from open_clip and the open_clip.tokenizer.HFTokenizer do pad by default to tokenizer.tokenizer.model_max_length, while Transformers CLIP do not. It is unclear to me why there is this difference.

Something you could try in the OpenCLIPOnnxConfig is first to remove the dynamic axis for the sequence length by redefining the inputs property:

    @property
    def inputs(self) -> Dict[str, Dict[int, str]]:
        return {
            "input_ids": {0: "text_batch_size"},
            "pixel_values": {0: "image_batch_size", 1: "num_channels", 2: "height", 3: "width"},
            "attention_mask": {0: "text_batch_size"},
        }

As for the shape itself used during the export, it should be held in this kwargs: https://github.com/huggingface/optimum/blob/15b8d1eed4d83c5004d3b60f6b6f13744b358f01/optimum/exporters/onnx/base.py#L451

What I may suggest is to override the shape from the openclip onnx config:

    def generate_dummy_inputs(self, framework: str = "pt", **kwargs):
        # override sequence_length shape here in the kwargs
        return super().generate_dummy_inputs(framework, **kwargs)

You will need to access the tokenizer though, that hopefully you can access under the self._preprocessors attribute.

fxmarty avatar Oct 19 '23 13:10 fxmarty

Thanks for the tips so far, @fxmarty ! I am wondering if we could draw some inspiration from pix2StructOnnxConfig, esp. def overwrite_shape_and_generate_input(...), where they also require padding inputs to the max length.

# pix2struct takes inputs whose so-called sequence length is **static** to max_patches, so we do NOT use
# the passed sequence_length that behaves as a dynamic shape.

Probably needs to be massaged to fit into our OpenCLIPOnnxConfig. Will investigate.

isaac-chung avatar Oct 20 '23 08:10 isaac-chung

@isaac-chung Yes that is a good option as well! It is just that overwrite_shape_and_generate_input is not defined at the top level OnnxConfig, so you would need to define it there (and basically just generate inputs, no override of shapes at the tope level), and redefine it in OpenClipOnnxConfig to override the relevant shapes.

fxmarty avatar Oct 20 '23 08:10 fxmarty

I see what you mean now, thanks again! Will continue with the generate_dummy_inputs approach. Seems more straightforward 💪

isaac-chung avatar Oct 20 '23 08:10 isaac-chung

Seeing a different error, so I'll count this as progress :D It seems like this issue has been resolved, but specifying --opset 18 and installing torch nightly yielded the following

 File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/convert.py", line 883, in export
    config.fix_dynamic_axes(output, device=device, input_shapes=input_shapes, dtype=dtype)
  File "/Users/isaacchung/work/optimum/optimum/exporters/onnx/base.py", line 302, in fix_dynamic_axes
    session = InferenceSession(model_path.as_posix(), providers=providers, sess_options=session_options)
  File "/Users/isaacchung/virtualenv/v1/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/Users/isaacchung/virtualenv/v1/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 463, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ArgMax(13) node with name '/ArgMax'

This looks similar to this issue but doesn't seem to apply to the types here (read that we only deal with fp16).

isaac-chung avatar Oct 20 '23 17:10 isaac-chung

Hi @isaac-chung , could you check where is the argmax in the modeling code? What do you mean by we only deal with fp16?

fxmarty avatar Oct 23 '23 07:10 fxmarty

The argmax is within model.encode_text. In this line, self.text_pool_type is arg_max.

What do you mean by we only deal with fp16?

I misread things, please disregard.

isaac-chung avatar Oct 23 '23 08:10 isaac-chung

Thank you! Can you check the dtype of text here https://github.com/mlfoundations/open_clip/blob/7b8dd2cbaf9cca13ca5b1defa6a321a145eb166c/src/open_clip/transformer.py#L558?

A solution would be to use the ModelPatcher (see e.g. https://github.com/huggingface/optimum/blob/15b8d1eed4d83c5004d3b60f6b6f13744b358f01/optimum/exporters/onnx/model_patcher.py#L638-L651 & https://github.com/huggingface/optimum/blob/15b8d1eed4d83c5004d3b60f6b6f13744b358f01/optimum/exporters/onnx/model_configs.py#L1415-L1418) to cast the arg text in text_global_pool to a dtype accepted by ONNX Runtime: https://github.com/microsoft/onnxruntime/blob/main/docs/OperatorKernels.md

Something like:

if is_open_clip_available():
    import open_clip

class OpenClipModelPatcher(...):
    def __enter__(...):
        ...
        open_clip.transformer.text_global_pool = text_global_pool_patched

    def __exit__(...):

fxmarty avatar Oct 23 '23 09:10 fxmarty

text is of type torch.Tensor.

And great, thank you! Let me take a look and report back.

isaac-chung avatar Oct 23 '23 09:10 isaac-chung

I meant dtype!

fxmarty avatar Oct 23 '23 10:10 fxmarty

Whoops, let me see. I got int64. Maybe we should cast it to something like int8?

In [5]: text.dtype
Out[5]: torch.int64

It is odd that the main branch does not have the same method.

isaac-chung avatar Oct 23 '23 10:10 isaac-chung

@isaac-chung According to https://github.com/microsoft/onnxruntime/blob/main/docs/OperatorKernels.md you could try to add a cast to torch.int32 before the operation.

fxmarty avatar Oct 23 '23 13:10 fxmarty

Turns out the newest commit that introduced open_clip.transformer.text_global_pool is not in v2.22.0 yet. So I installed from source.

With the newest commit here, the same NOT IMPLEMENTED error still persists. Wondering if my implementation is missing something glaringly obvious 🤔

isaac-chung avatar Oct 23 '23 15:10 isaac-chung

Just a heads up - I anticipate a busier week this week, so hoping to continue the investigation afterwards. If any new ideas pop to mind, feel free to continue the discussion and I can catch back up.

isaac-chung avatar Oct 30 '23 09:10 isaac-chung

It is possible that the discrepancy is due to the task being auto-detected as "zero-shot-classification" instead of "feature-extraction". Currently only the image/text embeds are in the output, where as for ZS-classification we need the logits (I was reading this). The scale more-or-less matches the magnitude of logit_scale.

I haven't been able to confirm this yet - will start looking into adding the feature-extraction task as well and how to add the logic for generating those logits from features in the model config. It seems like this might require the model patcher when the task is ZS-classification?

isaac-chung avatar Nov 07 '23 22:11 isaac-chung

Sorry for the long pause. Printing out the validation step ref outputs vs. onnx outputs, at a glance the values look very close.

Validating ONNX model clip_onnx/model.onnx...
ref_outputs_dict={'image_features': tensor([[ 0.0082, -0.0139, -0.0986,  ...,  0.0043,  0.0170, -0.0243],
        [ 0.0125, -0.0246, -0.0875,  ...,  0.0078,  0.0198, -0.0267]]), 'text_features': tensor([[-0.0114, -0.0058, -0.0060,  ..., -0.0212, -0.0546, -0.0145],
        [-0.0216, -0.0326, -0.0414,  ...,  0.0304,  0.0093,  0.0012]]), 'logit_scale': tensor(100.0000)}
onnx_outputs=[array([[ 0.00823098, -0.01392082, -0.09864607, ...,  0.00433102,
         0.01700564, -0.02425753],
       [ 0.01247843, -0.02459271, -0.08752848, ...,  0.00782712,
         0.01979438, -0.02666864]], dtype=float32), array([[-0.01138339, -0.00584862, -0.0059775 , ..., -0.02118963,
        -0.05462093, -0.01450986],
       [-0.02160461, -0.03263872, -0.04141301, ...,  0.03040083,
         0.00926695,  0.00119737]], dtype=float32), array(100.00001, dtype=float32)]

Validation would only pass if ATOL_FOR_VALIDATION = 1.

isaac-chung avatar Jan 16 '24 14:01 isaac-chung

Thanks all very much, looks like lot of work done to support CLIP indeed. I am looking to try export ONNX. I am using the L14 and here's the cli output. Will utilize and report back if I see any issues.

optimum-cli export onnx -m laion/CLIP-ViT-L-14-laion2B-s32B-b82K --framework pt clip_onnx .../python310_env/lib/python3.10/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Automatic task detection to zero-shot-image-classification. .../python310_env/lib/python3.10/site-packages/transformers/models/clip/feature_extraction_clip.py:28: FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead. warnings.warn( Using the export variant default. Available variants are: - default: The default ONNX variant.

***** Exporting submodel 1/1: CLIPModel ***** Using framework PyTorch: 2.2.1 .../python310_env/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:279: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len): .../python310_env/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:319: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): .../python310_env/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:86: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if input_shape[-1] > 1 or self.sliding_window is not None: .../python310_env/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py:162: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if past_key_values_length > 0: .../python310_env/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:287: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len): .../python310_env/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py:296: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attention_mask.size() != (bsz, 1, tgt_len, src_len): .../python310_env/lib/python3.10/site-packages/torch/onnx/symbolic_opset9.py:5859: UserWarning: Exporting aten::index operator of advanced indexing in opset 14 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results. warnings.warn( Post-processing the exported models... Deduplicating shared (tied) weights...

Validating ONNX model clip_onnx/model.onnx... -[✓] ONNX model output names match reference model (image_embeds, text_embeds, logits_per_text, logits_per_image) - Validating ONNX Model output "logits_per_image": -[✓] (2, 2) matches (2, 2) -[✓] all values close (atol: 1e-05) - Validating ONNX Model output "logits_per_text": -[✓] (2, 2) matches (2, 2) -[✓] all values close (atol: 1e-05) - Validating ONNX Model output "text_embeds": -[✓] (2, 768) matches (2, 768) -[✓] all values close (atol: 1e-05) - Validating ONNX Model output "image_embeds": -[✓] (2, 768) matches (2, 768) -[✓] all values close (atol: 1e-05) The ONNX export succeeded and the exported model was saved at: clip_onnx

kechan avatar May 31 '24 16:05 kechan