accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

'from transformers import Trainer' can hinder multi-gpu training process on Jupyter.

Open SHEN2BAIYI opened this issue 1 year ago • 8 comments

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.11.0-41-generic-x86_64-with-glibc2.31
- `accelerate` bash location: ***
- Python version: 3.9.19
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 62.50 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: MULTI_GPU
	- mixed_precision: no
	- use_cpu: False
	- debug: False
	- num_processes: 2
	- machine_rank: 0
	- num_machines: 1
	- rdzv_backend: static
	- same_network: False
	- main_training_function: main
	- enable_cpu_affinity: False
	- downcast_bf16: False
	- tpu_use_cluster: False
	- tpu_use_sudo: False

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction

1.Copy the official code in "https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_cv_example.ipynb"

2.Add "from transformers import Trainer" at the begining.

Expected behavior

multi-gpu training

SHEN2BAIYI avatar Jul 03 '24 07:07 SHEN2BAIYI

How can i solve this problem that i wanna import this package outside the "notebook_launcher" function?

SHEN2BAIYI avatar Jul 03 '24 08:07 SHEN2BAIYI

Please explain "hinder"?

You need to use the notebook_launcher if you want to do distributed training in a notebook

muellerzr avatar Jul 03 '24 10:07 muellerzr

Following my reproduction procedures, an Exception will be raised at "notebook_launcher". ------ RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

SHEN2BAIYI avatar Jul 03 '24 11:07 SHEN2BAIYI

You need to migrate the from transformers import Trainer in the training function. Anything that touches CUDA init has to be done in there.

muellerzr avatar Jul 03 '24 11:07 muellerzr

Thanks a lot. But is there any solution that doesn’t need to migrate code into ‘taining_loop’? I find that ‘import peft’ can also trigger this exception, and i wanna do some operations outside the loop.

SHEN2BAIYI avatar Jul 03 '24 11:07 SHEN2BAIYI

I find that ‘import peft’ can also trigger this exception

This has been fixed in PEFT (https://github.com/huggingface/peft/pull/1879) but the fix is not released yet. Installing from source should work though.

BenjaminBossan avatar Jul 03 '24 12:07 BenjaminBossan

We'd need to investigate what exactly is causing a cuda init in transformers

muellerzr avatar Jul 03 '24 15:07 muellerzr

Thank you. Now I can use "peft" outside of training function, but for "transformers," I have moved the code inside the function.

SHEN2BAIYI avatar Jul 04 '24 03:07 SHEN2BAIYI

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Aug 02 '24 15:08 github-actions[bot]

I forgot to mention, since PEFT release v0.12.0, this should be solved on the PEFT side.

BenjaminBossan avatar Aug 05 '24 11:08 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Sep 12 '24 15:09 github-actions[bot]