transformers
transformers copied to clipboard
[RAG] - TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port' for PL==1.5.10
System Info
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
absl-py 1.2.0 pypi_0 pypi
aiohttp 3.8.1 pypi_0 pypi
aiosignal 1.2.0 pypi_0 pypi
asttokens 2.0.5 pyhd3eb1b0_0
async-timeout 4.0.2 pypi_0 pypi
attrs 22.1.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
blas 1.0 mkl
bzip2 1.0.8 h7b6447c_0
ca-certificates 2022.07.19 h06a4308_0
cachetools 5.2.0 pypi_0 pypi
certifi 2022.6.15 py38h06a4308_0
charset-normalizer 2.1.0 pypi_0 pypi
click 8.0.4 pypi_0 pypi
cudatoolkit 11.3.1 h9edb442_10 conda-forge
datasets 2.4.0 pypi_0 pypi
debugpy 1.5.1 py38h295c915_0
decorator 5.1.1 pyhd3eb1b0_0
dill 0.3.5.1 pypi_0 pypi
distlib 0.3.5 pypi_0 pypi
entrypoints 0.4 py38h06a4308_0
executing 0.8.3 pyhd3eb1b0_0
faiss-cpu 1.7.2 pypi_0 pypi
ffmpeg 4.2.2 h20bf706_0
filelock 3.8.0 pypi_0 pypi
freetype 2.11.0 h70c0345_0
frozenlist 1.3.1 pypi_0 pypi
fsspec 2022.7.1 pypi_0 pypi
future 0.18.2 pypi_0 pypi
giflib 5.2.1 h7b6447c_0
gitdb 4.0.9 pypi_0 pypi
gitpython 3.1.27 pypi_0 pypi
gmp 6.2.1 h295c915_3
gnutls 3.6.15 he1e5248_0
google-auth 2.10.0 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
grpcio 1.43.0 pypi_0 pypi
huggingface-hub 0.8.1 pypi_0 pypi
idna 3.3 pypi_0 pypi
importlib-metadata 4.12.0 pypi_0 pypi
importlib-resources 5.9.0 pypi_0 pypi
intel-openmp 2021.4.0 h06a4308_3561
ipykernel 6.9.1 py38h06a4308_0
ipython 8.4.0 py38h06a4308_0
jedi 0.18.1 py38h06a4308_1
jpeg 9e h7f8727e_0
jsonschema 4.10.0 pypi_0 pypi
jupyter_client 7.2.2 py38h06a4308_0
jupyter_core 4.10.0 py38h06a4308_0
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libdeflate 1.8 h7f8727e_5
libedit 3.1.20210910 h7f8727e_0
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libidn2 2.3.2 h7f8727e_0
libopus 1.3.1 h7b6447c_0
libpng 1.6.37 hbc83047_0
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.16.0 h27cfd23_0
libtiff 4.4.0 hecacb30_0
libunistring 0.9.10 h27cfd23_0
libuv 1.40.0 h7b6447c_0
libvpx 1.7.0 h439df22_0
libwebp 1.2.2 h55f646e_0
libwebp-base 1.2.2 h7f8727e_0
lz4-c 1.9.3 h295c915_1
markdown 3.4.1 pypi_0 pypi
markupsafe 2.1.1 pypi_0 pypi
matplotlib-inline 0.1.2 pyhd3eb1b0_2
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h7f8727e_0
mkl_fft 1.3.1 py38hd3c417c_0
mkl_random 1.2.2 py38h51133e4_0
msgpack 1.0.4 pypi_0 pypi
multidict 6.0.2 pypi_0 pypi
multiprocess 0.70.13 pypi_0 pypi
ncurses 6.3 h5eee18b_3
nest-asyncio 1.5.5 py38h06a4308_0
nettle 3.7.3 hbbd107a_1
numpy 1.23.1 py38h6c91a56_0
numpy-base 1.23.1 py38ha15fc14_0
oauthlib 3.2.0 pypi_0 pypi
openh264 2.1.1 h4ff587b_0
openssl 1.1.1q h7f8727e_0
packaging 21.3 pypi_0 pypi
pandas 1.4.3 pypi_0 pypi
parso 0.8.3 pyhd3eb1b0_0
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 9.2.0 py38hace64e9_1
pip 22.1.2 py38h06a4308_0
pkgutil-resolve-name 1.3.10 pypi_0 pypi
platformdirs 2.5.2 pypi_0 pypi
prompt-toolkit 3.0.20 pyhd3eb1b0_0
protobuf 3.19.4 pypi_0 pypi
psutil 5.9.1 pypi_0 pypi
ptyprocess 0.7.0 pyhd3eb1b0_2
pure_eval 0.2.2 pyhd3eb1b0_0
pyarrow 9.0.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pydeprecate 0.3.1 pypi_0 pypi
pygments 2.11.2 pyhd3eb1b0_0
pyparsing 3.0.9 pypi_0 pypi
pyrsistent 0.18.1 pypi_0 pypi
python 3.8.13 h12debd9_0
python-dateutil 2.8.2 pyhd3eb1b0_0
pytorch 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0 pytorch
pytorch-lightning 1.5.10 pypi_0 pypi
pytorch-mutex 1.0 cuda pytorch
pytz 2022.2.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
pyzmq 23.2.0 py38h6a678d5_0
ray 1.13.0 pypi_0 pypi
readline 8.1.2 h7f8727e_1
regex 2022.7.25 pypi_0 pypi
requests 2.28.1 pypi_0 pypi
requests-oauthlib 1.3.1 pypi_0 pypi
responses 0.18.0 pypi_0 pypi
rsa 4.9 pypi_0 pypi
setuptools 59.5.0 pypi_0 pypi
six 1.16.0 pyhd3eb1b0_1
smmap 5.0.0 pypi_0 pypi
sqlite 3.39.2 h5082296_0
stack_data 0.2.0 pyhd3eb1b0_0
tensorboard 2.10.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.1 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tokenizers 0.12.1 pypi_0 pypi
torchaudio 0.10.1 py38_cu113 pytorch
torchmetrics 0.6.0 pypi_0 pypi
torchvision 0.11.2 py38_cu113 pytorch
tornado 6.1 py38h27cfd23_0
tqdm 4.64.0 pypi_0 pypi
traitlets 5.1.1 pyhd3eb1b0_0
transformers 4.21.1 pypi_0 pypi
typing_extensions 4.3.0 py38h06a4308_0
urllib3 1.26.11 pypi_0 pypi
virtualenv 20.16.3 pypi_0 pypi
wcwidth 0.2.5 pyhd3eb1b0_0
werkzeug 2.2.2 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
x264 1!157.20191217 h7b6447c_0
xxhash 3.0.0 pypi_0 pypi
xz 5.2.5 h7f8727e_1
yarl 1.8.1 pypi_0 pypi
zeromq 4.3.4 h2531618_0
zipp 3.8.1 pypi_0 pypi
zlib 1.2.12 h7f8727e_2
zstd 1.5.2 ha4553b6_0
Who can help?
@patrickvonplaten @lhoestq
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Hi, I'm try to run finetuning scirpt of RAG with pytorch-lightning==1.5.10
And below one is my training command.
python examples/research_projects/rag/finetune_rag.py \
--data_dir '/home/keonwookim/transformers/examples/research_projects/rag/finetune_data' \
--output_dir './finetune_rag_token' \
--model_name_or_path facebook/rag-token-base \
--model_type rag_token \
--do_train \
--do_predict \
--train_batch_size 1 \
--accumulate_grad_batches 1 \
--num_retrieval_workers 4 \
--max_epochs 3 \
--fp16 \
--gpus 2
However, I'm struggling with the following error for days.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /home/keonwookim/transformers/finetune_rag_token exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Traceback (most recent call last):
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
self.call_hook("on_sanity_check_start")
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 78, in on_sanity_check_start
callback.on_sanity_check_start(self, self.lightning_module)
File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 275, in on_sanity_check_start
pl_module.model.rag.retriever.init_retrieval() # better to use hook functions.
TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "examples/research_projects/rag/finetune_rag.py", line 649, in <module>
main(args)
File "examples/research_projects/rag/finetune_rag.py", line 613, in main
trainer: pl.Trainer = generic_train(
File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 405, in generic_train
trainer.fit(model)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 740, in fit
self._call_and_handle_interrupt(
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 698, in _call_and_handle_interrupt
self.training_type_plugin.reconciliate_processes(traceback.format_exc())
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 533, in reconciliate_processes
raise DeadlockDetectedException(f"DeadLock detected from rank: {self.global_rank} \n {trace}")
pytorch_lightning.utilities.exceptions.DeadlockDetectedException: DeadLock detected from rank: 0
Traceback (most recent call last):
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
self.call_hook("on_sanity_check_start")
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 78, in on_sanity_check_start
callback.on_sanity_check_start(self, self.lightning_module)
File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 275, in on_sanity_check_start
pl_module.model.rag.retriever.init_retrieval() # better to use hook functions.
TypeError: init_retrieval() missing 1 required positional argument: 'distributed_port'
In my opinion, it seems that InitCallback() in lightning_base.py is used, which is for Ray distribued_retriever.
Expected behavior
May be CustomDDP() should be exploited to train, which gives self.distribued_port as a argument for init_retrieval()
Hi kaiiwoo, I think you may try to change the code below
https://github.com/huggingface/transformers/blob/30992ef0d911bdeca425969d210771118a5cd1ac/examples/research_projects/rag/lightning_base.py#L275
to
pl_module.model.rag.retriever.init_retrieval(pl_module.hparams.distributed_port)
Hi, @aRyBernAlTEglOTRO
thanks for your suggestion.
but I've tried that before, it takes such a long time in initializing distribued retriever, and finally ends with socketerror.
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2,3]
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /home/keonwookim/transformers/finetune_rag_token exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
INFO:distributed_pytorch_retriever:initializing retrieval
INFO:distributed_pytorch_retriever:dist initialized
Traceback (most recent call last):
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1368, in _run_sanity_check
self.call_hook("on_sanity_check_start")
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 78, in on_sanity_check_start
callback.on_sanity_check_start(self, self.lightning_module)
File "/home/keonwookim/transformers/examples/research_projects/rag/lightning_base.py", line 275, in on_sanity_check_start
pl_module.model.rag.retriever.init_retrieval(pl_module.hparams.distributed_port) # better to use hook functions.
File "/home/keonwookim/transformers/examples/research_projects/rag/distributed_pytorch_retriever.py", line 66, in init_retrieval
self.process_group = dist.new_group(ranks=None, backend="gloo")
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2898, in new_group
pg = _new_process_group_helper(
File "/home/keonwookim/anaconda3/envs/odqa/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 684, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
RuntimeError: Socket Timeout
Hi @kaiiwoo, I think there is no problem with the code, just forget the method I mentioned earlier. I think you forgot to pass the params --distributed_retriever ray when running examples/research_projects/rag/finetune_rag.py, maybe you can try that. I think compared to the original CustomDDP, currently, the InitCallback can only work when the distributed_retriever is set to ray
hi I am the person who revamped the RAG code base with the latest PL. Please follow @aRyBernAlTEglOTRO 's instructions. The current version is only working with the RAY.
https://github.com/shamanez/transformers/blob/main/examples/research_projects/rag/lightning_base.py#L396
The main reason is the latest PL -lightning doesn't encourage us to use pluggings. So please use the ray as the default distributed retriever.
@patrickvonplaten Shall we update the README ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
We don't maintain the research folders actively. Feel free to open a PR though if you want @shamanez - happy to review and merge
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.