transformers [RAG] - TypeError: init_retrieval() missing 1 required positional argument: 'distributed

System Info

 Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
absl-py                   1.2.0                    pypi_0    pypi
aiohttp                   3.8.1                    pypi_0    pypi
aiosignal                 1.2.0                    pypi_0    pypi
asttokens                 2.0.5              pyhd3eb1b0_0  
async-timeout             4.0.2                    pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0  
blas                      1.0                         mkl  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2022.07.19           h06a4308_0  
cachetools                5.2.0                    pypi_0    pypi
certifi                   2022.6.15        py38h06a4308_0  
charset-normalizer        2.1.0                    pypi_0    pypi
click                     8.0.4                    pypi_0    pypi
cudatoolkit               11.3.1              h9edb442_10    conda-forge
datasets                  2.4.0                    pypi_0    pypi
debugpy                   1.5.1            py38h295c915_0  
decorator                 5.1.1              pyhd3eb1b0_0  
dill                      0.3.5.1                  pypi_0    pypi
distlib                   0.3.5                    pypi_0    pypi
entrypoints               0.4              py38h06a4308_0  
executing                 0.8.3              pyhd3eb1b0_0  
faiss-cpu                 1.7.2                    pypi_0    pypi
ffmpeg                    4.2.2                h20bf706_0  
filelock                  3.8.0                    pypi_0    pypi
freetype                  2.11.0               h70c0345_0  
frozenlist                1.3.1                    pypi_0    pypi
fsspec                    2022.7.1                 pypi_0    pypi
future                    0.18.2                   pypi_0    pypi
giflib                    5.2.1                h7b6447c_0  
gitdb                     4.0.9                    pypi_0    pypi
gitpython                 3.1.27                   pypi_0    pypi
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
google-auth               2.10.0                   pypi_0    pypi
google-auth-oauthlib      0.4.6                    pypi_0    pypi
grpcio                    1.43.0                   pypi_0    pypi
huggingface-hub           0.8.1                    pypi_0    pypi
idna                      3.3                      pypi_0    pypi
importlib-metadata        4.12.0                   pypi_0    pypi
importlib-resources       5.9.0                    pypi_0    pypi
intel-openmp              2021.4.0          h06a4308_3561  
ipykernel                 6.9.1            py38h06a4308_0  
ipython                   8.4.0            py38h06a4308_0  
jedi                      0.18.1           py38h06a4308_1  
jpeg                      9e                   h7f8727e_0  
jsonschema                4.10.0                   pypi_0    pypi
jupyter_client            7.2.2            py38h06a4308_0  
jupyter_core              4.10.0           py38h06a4308_0  
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libdeflate                1.8                  h7f8727e_5  
libedit                   3.1.20210910         h7f8727e_0  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libidn2                   2.3.2                h7f8727e_0  
libopus                   1.3.1                h7b6447c_0  
libpng                    1.6.37               hbc83047_0  
libsodium                 1.0.18               h7b6447c_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.16.0               h27cfd23_0  
libtiff                   4.4.0                hecacb30_0  
libunistring              0.9.10               h27cfd23_0  
libuv                     1.40.0               h7b6447c_0  
libvpx                    1.7.0                h439df22_0  
libwebp                   1.2.2                h55f646e_0  
libwebp-base              1.2.2                h7f8727e_0  
lz4-c                     1.9.3                h295c915_1  
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
matplotlib-inline         0.1.2              pyhd3eb1b0_2  
mkl                       2021.4.0           h06a4308_640  
mkl-service               2.4.0            py38h7f8727e_0  
mkl_fft                   1.3.1            py38hd3c417c_0  
mkl_random                1.2.2            py38h51133e4_0  
msgpack                   1.0.4                    pypi_0    pypi
multidict                 6.0.2                    pypi_0    pypi
multiprocess              0.70.13                  pypi_0    pypi
ncurses                   6.3                  h5eee18b_3  
nest-asyncio              1.5.5            py38h06a4308_0  
nettle                    3.7.3                hbbd107a_1  
numpy                     1.23.1           py38h6c91a56_0  
numpy-base                1.23.1           py38ha15fc14_0  
oauthlib                  3.2.0                    pypi_0    pypi
openh264                  2.1.1                h4ff587b_0  
openssl                   1.1.1q               h7f8727e_0  
packaging                 21.3                     pypi_0    pypi
pandas                    1.4.3                    pypi_0    pypi
parso                     0.8.3              pyhd3eb1b0_0  
pexpect                   4.8.0              pyhd3eb1b0_3  
pickleshare               0.7.5           pyhd3eb1b0_1003  
pillow                    9.2.0            py38hace64e9_1  
pip                       22.1.2           py38h06a4308_0  
pkgutil-resolve-name      1.3.10                   pypi_0    pypi
platformdirs              2.5.2                    pypi_0    pypi
prompt-toolkit            3.0.20             pyhd3eb1b0_0  
protobuf                  3.19.4                   pypi_0    pypi
psutil                    5.9.1                    pypi_0    pypi
ptyprocess                0.7.0              pyhd3eb1b0_2  
pure_eval                 0.2.2              pyhd3eb1b0_0  
pyarrow                   9.0.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pydeprecate               0.3.1                    pypi_0    pypi
pygments                  2.11.2             pyhd3eb1b0_0  
pyparsing                 3.0.9                    pypi_0    pypi
pyrsistent                0.18.1                   pypi_0    pypi
python                    3.8.13               h12debd9_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
pytorch                   1.10.1          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-lightning         1.5.10                   pypi_0    pypi
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2022.2.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     23.2.0           py38h6a678d5_0  
ray                       1.13.0                   pypi_0    pypi
readline                  8.1.2                h7f8727e_1  
regex                     2022.7.25                pypi_0    pypi
requests                  2.28.1                   pypi_0    pypi
requests-oauthlib         1.3.1                    pypi_0    pypi
responses                 0.18.0                   pypi_0    pypi
rsa                       4.9                      pypi_0    pypi
setuptools                59.5.0                   pypi_0    pypi
six                       1.16.0             pyhd3eb1b0_1  
smmap                     5.0.0                    pypi_0    pypi
sqlite                    3.39.2               h5082296_0  
stack_data                0.2.0              pyhd3eb1b0_0  
tensorboard               2.10.0                   pypi_0    pypi
tensorboard-data-server   0.6.1                    pypi_0    pypi
tensorboard-plugin-wit    1.8.1                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0  
tokenizers                0.12.1                   pypi_0    pypi
torchaudio                0.10.1               py38_cu113    pytorch
torchmetrics              0.6.0                    pypi_0    pypi
torchvision               0.11.2               py38_cu113    pytorch
tornado                   6.1              py38h27cfd23_0  
tqdm                      4.64.0                   pypi_0    pypi
traitlets                 5.1.1              pyhd3eb1b0_0  
transformers              4.21.1                   pypi_0    pypi
typing_extensions         4.3.0            py38h06a4308_0  
urllib3                   1.26.11                  pypi_0    pypi
virtualenv                20.16.3                  pypi_0    pypi
wcwidth                   0.2.5              pyhd3eb1b0_0  
werkzeug                  2.2.2                    pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0  
x264                      1!157.20191217       h7b6447c_0  
xxhash                    3.0.0                    pypi_0    pypi
xz                        5.2.5                h7f8727e_1  
yarl                      1.8.1                    pypi_0    pypi
zeromq                    4.3.4                h2531618_0  
zipp                      3.8.1                    pypi_0    pypi
zlib                      1.2.12               h7f8727e_2  
zstd                      1.5.2                ha4553b6_0

Who can help?

@patrickvonplaten @lhoestq

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Hi, I'm try to run finetuning scirpt of RAG with pytorch-lightning==1.5.10

And below one is my training command.

python examples/research_projects/rag/finetune_rag.py \
    --data_dir '/home/keonwookim/transformers/examples/research_projects/rag/finetune_data' \
    --output_dir './finetune_rag_token' \
    --model_name_or_path facebook/rag-token-base \
    --model_type rag_token \
    --do_train \
    --do_predict \
    --train_batch_size 1 \
    --accumulate_grad_batches 1 \
    --num_retrieval_workers 4 \
    --max_epochs 3 \
    --fp16 \
    --gpus 2

However, I'm struggling with the following error for days.

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /home/keonwookim/transformers/finetune_rag_token exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Traceback (most recent call last):
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
    self.call_hook("on_sanity_check_start")
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 78, in on_sanity_check_start
    callback.on_sanity_check_start(self, self.lightning_module)
  File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 275, in on_sanity_check_start
    pl_module.model.rag.retriever.init_retrieval()  # better to use hook functions.
TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "examples/research_projects/rag/finetune_rag.py", line 649, in <module>
    main(args)
  File "examples/research_projects/rag/finetune_rag.py", line 613, in main
    trainer: pl.Trainer = generic_train(
  File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 405, in generic_train
    trainer.fit(model)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
    self._call_and_handle_interrupt(
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 698, in _call_and_handle_interrupt
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 533, in reconciliate_processes
    raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0 
 Traceback (most recent call last):
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
    self.call_hook("on_sanity_check_start")
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 78, in on_sanity_check_start
    callback.on_sanity_check_start(self, self.lightning_module)
  File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 275, in on_sanity_check_start
    pl_module.model.rag.retriever.init_retrieval()  # better to use hook functions.
TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port'

In my opinion, it seems that InitCallback() in lightning_base.py is used, which is for Ray distribued_retriever.

Expected behavior

May be CustomDDP() should be exploited to train, which gives self.distribued_port as a argument for init_retrieval()

Aug 20 '22 09:08 knwoo

Hi kaiiwoo, I think you may try to change the code below https://github.com/huggingface/transformers/blob/30992ef0d911bdeca425969d210771118a5cd1ac/examples/research_projects/rag/lightning_base.py#L275 to pl_module.model.rag.retriever.init_retrieval(pl_module.hparams.distributed_port)

Aug 20 '22 10:08 aRyBernAlTEglOTRO

Hi, @aRyBernAlTEglOTRO

thanks for your suggestion.

but I've tried that before, it takes such a long time in initializing distribued retriever, and finally ends with socketerror.

All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2,3]
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /home/keonwookim/transformers/finetune_rag_token exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
Traceback (most recent call last):
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
    self._run_sanity_check(self.lightning_module)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
    self.call_hook("on_sanity_check_start")
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 78, in on_sanity_check_start
    callback.on_sanity_check_start(self, self.lightning_module)
  File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 275, in on_sanity_check_start
    pl_module.model.rag.retriever.init_retrieval(pl_module.hparams.distributed_port)  # better to use hook functions.
  File "/home/keonwookim/transformers/examples/research_projects/rag/distributed_pytorch_retriever.py", line 66, in init_retrieval
    self.process_group = dist.new_group(ranks=None, backend="gloo")
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2898, in new_group
    pg = _new_process_group_helper(
  File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 684, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
RuntimeError: Socket Timeout

Aug 20 '22 15:08 knwoo

Hi @kaiiwoo, I think there is no problem with the code, just forget the method I mentioned earlier. I think you forgot to pass the params --distributed_retriever ray when running examples/research_projects/rag/finetune_rag.py, maybe you can try that. I think compared to the original CustomDDP, currently, the InitCallback can only work when the distributed_retriever is set to ray

Aug 20 '22 16:08 aRyBernAlTEglOTRO

hi I am the person who revamped the RAG code base with the latest PL. Please follow @aRyBernAlTEglOTRO 's instructions. The current version is only working with the RAY.

https://github.com/shamanez/transformers/blob/main/examples/research_projects/rag/lightning_base.py#L396

The main reason is the latest PL -lightning doesn't encourage us to use pluggings. So please use the ray as the default distributed retriever.

@patrickvonplaten Shall we update the README ?

Aug 21 '22 11:08 shamanez

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 19 '22 15:09 github-actions[bot]

We don't maintain the research folders actively. Feel free to open a PR though if you want @shamanez - happy to review and merge

Sep 27 '22 08:09 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 21 '22 15:10 github-actions[bot]

transformers
transformers copied to clipboard

[RAG] - TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port' for PL==1.5.10

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers transformers copied to clipboard

[RAG] - TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port' for PL==1.5.10

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard