transformers
transformers copied to clipboard
Add ONNX support for Longformer
What does this PR do?
This PR contributes to #16308 and addresses #16463 by adding support for exporting Longformer to ONNX.
The following necessary changes were already made:
- [x]
LongformerOnnxConfig
implemented - [x] ONNX opset version >= 12
- [x] fix in model definition with
nn.functional.pad
(see https://github.com/huggingface/transformers/issues/13126#issuecomment-993645323)
However, there are still some open issues I'd need help with:
- [x] ~The conversion to ONNX fails when a
global_attention_mask
is provided that contains at least one1
. It raises the following error:Only consecutive 1-d tensor indices are supported in exporting aten::index_put to ONNX.
. So far, I have been unable to track down which line triggers this error. If we find it, we can probably rewrite the model implementation using this workaround: https://pytorch.org/docs/stable/onnx.html#writes-sets~ → issue resolved by rewriting accesses - [x] ~The validation check currently fails with a high value difference (3.77). The JIT conversion raises the following warnings. Maybe some of them are the reasons for it:~ → tracked down and fixed
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1569: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if padding_len > 0:
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1256: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
is_global_attn = is_index_global_attn.flatten().any().item()
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:569: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert (
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:805: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert (
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:808: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert query.size() == key.size()
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:598: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert list(attn_scores.size()) == [
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:873: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert seq_len % (window_overlap * 2) == 0
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:874: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert attn_probs.size()[:3] == value.size()[:3]
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:875: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert attn_probs.size(3) == 2 * window_overlap + 1
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:669: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert attn_output.size() == (batch_size, seq_len, self.num_heads, self.head_dim), "Unexpected size"
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1312: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if padding_len > 0:
Before submitting
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case: #16308, #16463
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests? → default Longformer and ONNX tests
Who can review?
Maybe @ChainYo and/or @lewtun can help with this? 😊
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
Hey :hand: excellent PR, the code looks just fine!
I wonder if you tried to specify the right --feature
while converting your LongFormer
model?
Which model did you try and what --feature
did you choose?
Hey ✋ excellent PR, the code looks just fine!
Thanks!
I wonder if you tried to specify the right
--feature
while converting yourLongFormer
model? Which model did you try and what--feature
did you choose?
I'm currently experimenting with longformer-base-4096. The reported difference of 3.77 is with --feature=default
, but there are large differences with all other features as well (masked-lm
: 14.1, sequence-classification
: 0.04, question-answering
: 0.25, token-classification
: 0.19, multiple-choice
: 0.1).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey @deutschmn, did you finally get good results with Longformer
?
@ChainYo Unfortunately, I didn't get a chance to dive in further yet. I'll try to find some time, but if someone else has any ideas, please let me know.
Hey @ChainYo! I found some time and fixed the issues. Can we reopen? 😊
Adding support for the global_attention_mask
was pretty easy after I tracked down the unsupported indexing lines, but it took quite a deep dive to find out where the value difference came from. There were two main issues:
-
masked_fill_
produces different results when converting to ONNX. I replaced it with a simplewhere
. -
as_strided
for chunking doesn't work either, presumably because it relies on the underlying memory layout that's different in ONNX. The perfect solution would be to useunfold
, but unfortunately, that op is not supported. So I added a slow fallback that works in every case. Once there's support forunfold
, we can get rid of that.
Hey @ChainYo! I found some time and fixed the issues. Can we reopen?
Hey, thanks for iterating on this. I will ping @lewtun to open this again.
Thanks a lot for re-working on this @deutschmn ❤️ ! Ping me when you'd like a review :)
Thanks for reopening, @lewtun. Would be brilliant if you could review now 😊
Thanks for your reviews, @lewtun and @patrickvonplaten 😊 I worked in all your feedback and added Longformer to the ONNX tests. Slow ONNX + Longformer tests seem to work fine:
RUN_SLOW=1 pytest tests/models/longformer/test_modeling_longformer.py
→ 55 passed, 10 skipped, 14 warnings
=================================================================== test session starts ===================================================================
platform darwin -- Python 3.9.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/patrick/Projects/open-source/transformers, configfile: setup.cfg
plugins: xdist-2.5.0, hypothesis-6.46.3, forked-1.4.0, timeout-2.1.0, dash-2.4.1
collected 65 items
tests/models/longformer/test_modeling_longformer.py ...s.sss..................... [100%]
============================= warnings summary =============================
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/Projects/open-source/transformers/src/transformers/image_utils.py:222: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torchvision/transforms/functional_pil.py:228: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
interpolation: int = Image.BILINEAR,
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torchvision/transforms/functional_pil.py:295: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
interpolation: int = Image.NEAREST,
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torchvision/transforms/functional_pil.py:311: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
interpolation: int = Image.NEAREST,
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torchvision/transforms/functional_pil.py:328: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
interpolation: int = Image.BICUBIC,
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/auto_augment.py:39: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/auto_augment.py:39: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/transforms.py:39: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
Image.NEAREST: 'nearest',
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/transforms.py:40: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
Image.BILINEAR: 'bilinear',
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/transforms.py:41: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
Image.BICUBIC: 'bicubic',
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/transforms.py:42: DeprecationWarning: BOX is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BOX instead.
Image.BOX: 'box',
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/transforms.py:43: DeprecationWarning: HAMMING is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.HAMMING instead.
Image.HAMMING: 'hamming',
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/timm/data/transforms.py:44: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead.
Image.LANCZOS: 'lanczos',
tests/models/longformer/test_modeling_longformer.py::LongformerModelTest::test_training_gradient_checkpointing
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========== 55 passed, 10 skipped, 14 warnings in 86.62s (0:01:26) ==========
RUN_SLOW=1 pytest tests/onnx/test_onnx_v2.py -k "longformer"
→ 12 passed, 377 deselected, 228 warnings
=========================================================================================== test session starts ===========================================================================================
platform darwin -- Python 3.9.10, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/patrick/Projects/open-source/transformers, configfile: setup.cfg
plugins: xdist-2.5.0, hypothesis-6.46.3, forked-1.4.0, timeout-2.1.0, dash-2.4.1
collected 389 items / 377 deselected / 12 selected
tests/onnx/test_onnx_v2.py ............ [100%]
============================================================================================ warnings summary =============================================================================================
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1610: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if padding_len > 0:
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torch/_tensor.py:627: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
return self.item().__format__(format_spec)
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torch/nn/functional.py:2165: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert padding_idx < weight.size(0), "Padding_idx must be within num_embeddings"
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1297: TracerWarning: Converting a tensor to a Python number might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
is_global_attn = is_index_global_attn.flatten().any().item()
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:565: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert (
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:832: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert (
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:835: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert query.size() == key.size()
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:785: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if hidden_states.size(1) == window_overlap * 2:
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:594: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert list(attn_scores.size()) == [
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:900: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert seq_len % (window_overlap * 2) == 0
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:901: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert attn_probs.size()[:3] == value.size()[:3]
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:902: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert attn_probs.size(3) == 2 * window_overlap + 1
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:668: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert attn_output.size() == (batch_size, seq_len, self.num_heads, self.head_dim), "Unexpected size"
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1072: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert list(global_attn_scores.size()) == [
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1122: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert list(global_attn_output.size()) == [
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:691: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
len(is_local_index_global_attn_nonzero[0]), -1
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/Projects/open-source/transformers/src/transformers/models/longformer/modeling_longformer.py:1353: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if padding_len > 0:
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torch/onnx/symbolic_helper.py:719: UserWarning: allowzero=0 by default. In order to honor zero value in shape use allowzero=1
warnings.warn("allowzero=0 by default. In order to honor zero value in shape use allowzero=1")
tests/onnx/test_onnx_v2.py: 12 warnings
/Users/patrick/.pyenv-x86/versions/3.9.10/envs/transformers-x86_64/lib/python3.9/site-packages/torch/onnx/symbolic_opset9.py:2905: UserWarning: Exporting aten::index operator of advanced indexing in opset 14 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.
warnings.warn("Exporting aten::index operator of advanced indexing in opset " +
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 12 passed, 377 deselected, 228 warnings in 3599.78s (0:59:59) ======================================================================
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I merged main
into this branch to resolve conflicts. Gently pinging @lewtun and @patrickvonplaten for a re-review 😊
Hey @ChainYo! I found some time and fixed the issues. Can we reopen? 😊
Adding support for the
global_attention_mask
was pretty easy after I tracked down the unsupported indexing lines, but it took quite a deep dive to find out where the value difference came from. There were two main issues:
masked_fill_
produces different results when converting to ONNX. I replaced it with a simplewhere
.as_strided
for chunking doesn't work either, presumably because it relies on the underlying memory layout that's different in ONNX. The perfect solution would be to useunfold
, but unfortunately, that op is not supported. So I added a slow fallback that works in every case. Once there's support forunfold
, we can get rid of that.
Hi @deutschmn, thanks for contributing! As for the tracing problem of masked_fill_
and as_strided
, they are both supported in torch.onnx.symbolic_opset9
, have you tried interpreting the forward pass of LongformerSelfAttention
with a symbolic
method to apply the symbolic tracing?
REF
- Symbolic doc in PyTorch
- An example: how it was done for DeBERTa
https://github.com/huggingface/transformers/blob/df28de0581aaf6d8742c4988137caac2b6602ca8/src/transformers/models/deberta/modeling_deberta.py#L122-L137
Hey @JingyaHuang, thanks for your feedback 😊 I haven't looked into symbolic tracing yet. I'm travelling right now, but I'll have another look when I'm back in a couple of weeks.