accelerate
accelerate copied to clipboard
accelerator.end_training() is generating exception when wandb is being used as tracker
System Info
- `Accelerate` version: 0.15.0
- Platform: macOS-13.1-arm64-i386-64bit
- Python version: 3.9.15
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1 (False)
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MPS
- mixed_precision: bf16
- use_cpu: False
- dynamo_backend: NO
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: None
- main_process_ip: None
- main_process_port: None
- rdzv_backend: static
- same_network: True
- main_training_function: main
- deepspeed_config: {}
- fsdp_config: {}
- megatron_lm_config: {}
- downcast_bf16: no
- tpu_name: None
- tpu_zone: None
- command_file: None
- commands: None
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [ ] My own task or dataset (give details below)
Reproduction
I am initiating my accelerator tracker in this way
if args.with_tracking:
experiment_config = vars(args)
experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"]
wandb.login(key=os.environ.get("WANDB_API_KEY"))
accelerator.init_trackers(
project_name=os.environ.get('WANDB_PROJECT_NAME'),
config=experiment_config,
init_kwargs={
"wandb": {
"job_type": "train",
"entity": os.environ.get('WANDB_ENTITY_NAME'),
"name": get_training_job_name()
}
}
)
and finishing my experiment in this way
if args.with_tracking:
accelerator.end_training()
It runs the complete training successfully and also the wandb run finishes but at the end it throws the following exception.
It throws the below exception
Exception in thread SockSrvRdThr:
Traceback (most recent call last):
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 112, in run
shandler(sreq)
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 173, in server_record_publish
iface = self._mux.get_stream(stream_id).interface
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/streams.py", line 199, in get_stream
stream = self._streams[stream_id]
KeyError: '3lxi4eq2'
where the key 3lxi4eq2
is actually the wandb run_id
Expected behavior
Exception should not be thrown at `accelerator.end_training()`
cc @muellerzr
+1. I am facing the same issue.
@somepago any chance you could give some more information on your setup or script? I haven't been able to recreate this quite yet.
Are we launching it from Jupyter or the terminal?
Hello,
I am facing the same problem, I trained a HF Transformers model using accelerate Multi-GPU (2 GPUs and no additional optimizations) and I run my script from a terminal.
By the way, Wandb interface shows all my training informations, just the end_tracking()
method failed.
If you want additional information, I can help.
Hi,
I meet the same problem when running "run_glue_no_trainer.py" script.
Here is my script.
export WANDB_API_KEY="xxxx"
accelerate launch run_glue_no_trainer.py \
--model_name_or_path bert-base-cased \
--task_name sst2 \
--max_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--output_dir ../checkpoint/sst2 \
--with_tracking \
--report_to wandb
The version of accelerate is 0.15.0. The version of wandb is 0.13.2.
Same problem
Python 3.10.8
accelerate==0.15.0
wandb==0.13.9
We've reached out to the W&B folks, we should have a solution soon!
I'm having the same issue
Trying to reproduce this, but discovered that I am unable to tap into mps
GPU anymore using accelerate...
Same issue here. I'm using accelerate 0.15.0
Solved it by passing keyword settings in init_kwargs and then passing it to accelerate.init. If using in colab then use thread instead of fork. Reference link
init_kwargs={"wandb":{"group":wandb_dict['group_name'],"name":wandb_dict['display_name'],'settings':wandb.Settings(start_method="fork")}}
accelerator.init_trackers(wandb_dict['project_name'], config=parameter, init_kwargs=init_kwargs)
@nabarunbaruaAIML thanks for the pointer! Will pass along to the W&B team. They should have a fix by the next release as they've identified the problem as well.
The workaround they suggested was disabling the console for now:
init_kwargs = {"wandb":{"settings":{"console": "off"}}}
accelerator.init_trackers("glue_no_trainer", experiment_config, init_kwargs=init_kwargs)
@nabarunbaruaAIML - Can you share the version of accelerate and wandb you are using? Thanks!
@somepago : I am using these versions accelerate=0.16.0 & wandb=0.13.9
The fix for this should be out now, let us know if you all are still seeing this issue