transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Transformer XL training fails because of IndexError due to change in ModuleList for torch>1.11

Open krishnanNuance opened this issue 2 years ago • 7 comments

System Info

Transformer version- 4.24 Torch version> 1.11

Stacktrace:

venv/lib/python3.8/site-packages/transformers/models/transfo_xl/modeling_transfo_xl.py:1115: in forward
    softmax_output = self.crit(pred_hid, labels)
venv/lib/python3.8/site-packages/torch/nn/modules/module.py:1190: in _call_impl
    return forward_call(*input, **kwargs)
venv/lib/python3.8/site-packages/torch/nn/modules/module.py:1178: in _slow_forward
    result = self.forward(*input, **kwargs)
venv/lib/python3.8/site-packages/transformers/models/transfo_xl/modeling_transfo_xl_utilities.py:134: in forward
    head_weight, head_bias, head_proj = weights[0], biases[0], self.out_projs[0]
venv/lib/python3.8/site-packages/torch/nn/modules/container.py:282: in __getitem__
    return self._modules[self._get_abs_string_index(idx)]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = ModuleList(), idx = 0

    def _get_abs_string_index(self, idx):
        """Get the absolute index for the list of modules"""
        idx = operator.index(idx)
        if not (-len(self) <= idx < len(self)):
>           raise IndexError('index {} is out of range'.format(idx))
E           IndexError: index 0 is out of range

venv/lib/python3.8/site-packages/torch/nn/modules/container.py:272: IndexError

Please do let me know if further info is required.

Who can help?

@patrickvonplaten

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Use generic torch src_token as input with d_model=d_embed with torch>1.11

Expected behavior

Should work with different torch versions

krishnanNuance avatar Dec 01 '22 14:12 krishnanNuance

Thanks for reporting but could you give us a short reproducer as our CI didn't catch any regression here?

sgugger avatar Dec 01 '22 17:12 sgugger

Thanks for reporting but could you give us a short reproducer as our CI didn't catch any regression here?

I run it as a part of fairseq. This test case-https://github.com/facebookresearch/fairseq/blob/main/tests/test_binaries.py#L1319 also fails due to same reason. IIUC, in the fairseq case d_embed=d_model maybe this condition is required to reproduce the issue?

krishnanNuance avatar Dec 02 '22 07:12 krishnanNuance

That's not exactly a small reproducer we can run on our side ;-)

sgugger avatar Dec 05 '22 16:12 sgugger

Can you point me to the test case that tests the training of the transformer XL model in huggingface? Maybe I can tune the parameters accordingly to reproduce the issue

krishnanNuance avatar Dec 06 '22 06:12 krishnanNuance

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Dec 31 '22 15:12 github-actions[bot]

actually this is still a problem. Can you please try by setting the params d_embed and d_model iwith same value?

krishnanNuance avatar Jan 02 '23 02:01 krishnanNuance

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 26 '23 15:01 github-actions[bot]