accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

accelerator.end_training() is generating exception when wandb is being used as tracker

Open DuttaSamarpan opened this issue 2 years ago • 15 comments

System Info

- `Accelerate` version: 0.15.0
- Platform: macOS-13.1-arm64-i386-64bit
- Python version: 3.9.15
- Numpy version: 1.24.0
- PyTorch version (GPU?): 1.13.1 (False)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MPS
        - mixed_precision: bf16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 1
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: None
        - main_process_ip: None
        - main_process_port: None
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no
        - tpu_name: None
        - tpu_zone: None
        - command_file: None
        - commands: None

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction

I am initiating my accelerator tracker in this way

    if args.with_tracking:
        experiment_config = vars(args)
        experiment_config["lr_scheduler_type"] = experiment_config["lr_scheduler_type"]
        wandb.login(key=os.environ.get("WANDB_API_KEY"))
        accelerator.init_trackers(
            project_name=os.environ.get('WANDB_PROJECT_NAME'),
            config=experiment_config,
            init_kwargs={
                "wandb": {
                    "job_type": "train",
                    "entity": os.environ.get('WANDB_ENTITY_NAME'),
                    "name": get_training_job_name()
                }
            }
        )

and finishing my experiment in this way

    if args.with_tracking:
        accelerator.end_training()

It runs the complete training successfully and also the wandb run finishes but at the end it throws the following exception.

It throws the below exception

Exception in thread SockSrvRdThr:
Traceback (most recent call last):
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/threading.py", line 980, in _bootstrap_inner
self.run()
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 112, in run
shandler(sreq)
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/server_sock.py", line 173, in server_record_publish
iface = self._mux.get_stream(stream_id).interface
File "/Users/samarpandutta/miniforge3/envs/banjo-accelerate-demo/lib/python3.9/site-packages/wandb/sdk/service/streams.py", line 199, in get_stream
stream = self._streams[stream_id]
KeyError: '3lxi4eq2'

where the key 3lxi4eq2 is actually the wandb run_id

Expected behavior

Exception should not be thrown at `accelerator.end_training()`

DuttaSamarpan avatar Dec 25 '22 04:12 DuttaSamarpan

cc @muellerzr

sgugger avatar Dec 26 '22 06:12 sgugger

+1. I am facing the same issue.

somepago avatar Dec 28 '22 20:12 somepago

@somepago any chance you could give some more information on your setup or script? I haven't been able to recreate this quite yet.

Are we launching it from Jupyter or the terminal?

muellerzr avatar Jan 03 '23 15:01 muellerzr

Hello,

I am facing the same problem, I trained a HF Transformers model using accelerate Multi-GPU (2 GPUs and no additional optimizations) and I run my script from a terminal.

By the way, Wandb interface shows all my training informations, just the end_tracking() method failed.

If you want additional information, I can help.

cloud441 avatar Jan 06 '23 09:01 cloud441

Hi,

I meet the same problem when running "run_glue_no_trainer.py" script.

Here is my script.

export WANDB_API_KEY="xxxx"

accelerate launch run_glue_no_trainer.py \
--model_name_or_path bert-base-cased \
--task_name sst2 \
--max_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 1 \
--output_dir ../checkpoint/sst2 \
--with_tracking \
--report_to wandb 

The version of accelerate is 0.15.0. The version of wandb is 0.13.2.

yjw1029 avatar Jan 07 '23 14:01 yjw1029

Same problem

Python 3.10.8

accelerate==0.15.0
wandb==0.13.9

detkov avatar Jan 18 '23 16:01 detkov

We've reached out to the W&B folks, we should have a solution soon!

muellerzr avatar Jan 18 '23 17:01 muellerzr

I'm having the same issue

hmartiro avatar Jan 26 '23 23:01 hmartiro

Trying to reproduce this, but discovered that I am unable to tap into mps GPU anymore using accelerate...

tcapelle avatar Jan 30 '23 02:01 tcapelle

Same issue here. I'm using accelerate 0.15.0

duxiaodan avatar Jan 31 '23 23:01 duxiaodan

Solved it by passing keyword settings in init_kwargs and then passing it to accelerate.init. If using in colab then use thread instead of fork. Reference link

init_kwargs={"wandb":{"group":wandb_dict['group_name'],"name":wandb_dict['display_name'],'settings':wandb.Settings(start_method="fork")}}
         
        accelerator.init_trackers(wandb_dict['project_name'], config=parameter, init_kwargs=init_kwargs)

nabarunbaruaAIML avatar Feb 01 '23 15:02 nabarunbaruaAIML

@nabarunbaruaAIML thanks for the pointer! Will pass along to the W&B team. They should have a fix by the next release as they've identified the problem as well.

The workaround they suggested was disabling the console for now:

        init_kwargs = {"wandb":{"settings":{"console": "off"}}}
        accelerator.init_trackers("glue_no_trainer", experiment_config, init_kwargs=init_kwargs)

muellerzr avatar Feb 01 '23 15:02 muellerzr

@nabarunbaruaAIML - Can you share the version of accelerate and wandb you are using? Thanks!

somepago avatar Feb 01 '23 19:02 somepago

@somepago : I am using these versions accelerate=0.16.0 & wandb=0.13.9

nabarunbaruaAIML avatar Feb 01 '23 19:02 nabarunbaruaAIML

The fix for this should be out now, let us know if you all are still seeing this issue

muellerzr avatar Feb 28 '23 15:02 muellerzr