accelerate
accelerate copied to clipboard
'from transformers import Trainer' can hinder multi-gpu training process on Jupyter.
System Info
- `Accelerate` version: 0.31.0
- Platform: Linux-5.11.0-41-generic-x86_64-with-glibc2.31
- `accelerate` bash location: ***
- Python version: 3.9.19
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 62.50 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: False
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: False
- tpu_use_cluster: False
- tpu_use_sudo: False
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
1.Copy the official code in "https://github.com/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_cv_example.ipynb"
2.Add "from transformers import Trainer" at the begining.
Expected behavior
multi-gpu training
How can i solve this problem that i wanna import this package outside the "notebook_launcher" function?
Please explain "hinder"?
You need to use the notebook_launcher if you want to do distributed training in a notebook
Following my reproduction procedures, an Exception will be raised at "notebook_launcher". ------ RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
You need to migrate the from transformers import Trainer in the training function. Anything that touches CUDA init has to be done in there.
Thanks a lot. But is there any solution that doesn’t need to migrate code into ‘taining_loop’? I find that ‘import peft’ can also trigger this exception, and i wanna do some operations outside the loop.
I find that ‘import peft’ can also trigger this exception
This has been fixed in PEFT (https://github.com/huggingface/peft/pull/1879) but the fix is not released yet. Installing from source should work though.
We'd need to investigate what exactly is causing a cuda init in transformers
Thank you. Now I can use "peft" outside of training function, but for "transformers," I have moved the code inside the function.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I forgot to mention, since PEFT release v0.12.0, this should be solved on the PEFT side.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.