accelerate `AcceleratorState` object has no attribute `distributed

System Info

accelerate-0.30.1
Google Colab
numpy-1.25.2
torch-2.2.1+cu121

Python 3.10.12

Regarding the accelerate configuration, I am using trainer which employs accelerate inside it, and I do not touch the configuration.

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

The args.json file employed below is available to download at: https://drive.google.com/file/d/1H2MstSq_oz7Xv7spMZCppf39fHGs5rW0/view?usp=drive_link.

The dataset specified in the args.json is the file: https://drive.google.com/file/d/18OVilNSqQQogSMiepe87vtNmzYpCalCs/view?usp=drive_link

In Google Colab, I coded:

!git clone https://github.com/evelinamorim/Seq2seqCoref.git
!pip install -U transformers accelerate

import sys
sys.path.insert(1, "Seq2seqCoref")

from transformers import HfArgumentParser, set_seed
from transformers import AutoModelForSeq2SeqLM, \
    DataCollatorForSeq2Seq, AutoConfig, AutoTokenizer
from transformers.integrations import TensorBoardCallback

from arguments import DataArguments, ModelArguments, CorefTrainingArguments \
    as TrainingArguments
from constants import SPEAKER_START, SPEAKER_END, MENTION_START, MENTION_END, \
    COPY, CLUSTER_NEW, CLUSTERS, SENTENCE_START, SENTENCE_END, SPECIAL_IDS, \
    NON_INT_SPECIAL_IDS, MARK_SPECIAL_IDS, MENTION_END_NON_INT_SPECIAL_IDS, \
    MENTION_ENDS
from data import CorefDataset
from trainer import CorefTrainer
import os

parser = HfArgumentParser(
        (ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_json_file(
        json_file=os.path.abspath("args.json"))

set_seed(training_args.seed)

# tokenizer setup
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path)

num_new_tokens = tokenizer.add_tokens([SPEAKER_START, SPEAKER_END,
                                           MENTION_START, MENTION_END,
                                           COPY])
num_new_tokens += tokenizer.add_tokens([SENTENCE_START, SENTENCE_END])

# loading config and model
config = AutoConfig.from_pretrained(model_args.model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(
        model_args.model_name_or_path, config=config)

# data objects
collator = DataCollatorForSeq2Seq(tokenizer, model=model)
train_set = CorefDataset(tokenizer, data_args, training_args, 'train')

tb_callback = TensorBoardCallback()
trainer = CorefTrainer(
        tokenizer=tokenizer,
        model=model,
        args=training_args,
        train_dataset=train_set,
        #        eval_dataset=dev_set,
        data_collator=collator,
        callbacks=[tb_callback]
    )

trainer.train()

The traceback error is:

AttributeError                            Traceback (most recent call last)
<ipython-input-16-3435b262f1ae> in <cell line: 1>()
----> 1 trainer.train()

5 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1857                 hf_hub_utils.enable_progress_bars()
   1858         else:
-> 1859             return inner_training_loop(
   1860                 args=args,
   1861                 resume_from_checkpoint=resume_from_checkpoint,

/content/Seq2seqCoref/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
    169         self._train_batch_size = batch_size
    170         # Data loader and number of training steps
--> 171         train_dataloader = self.get_train_dataloader()
    172 
    173         # Setting up training control variables:

/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in get_train_dataloader(self)
    877             dataloader_params["prefetch_factor"] = self.args.dataloader_prefetch_factor
    878 
--> 879         return self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
    880 
    881     def _get_eval_sampler(self, eval_dataset: Dataset) -> Optional[torch.utils.data.Sampler]:

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py in prepare(self, device_placement, *args)
   1246                 )
   1247 
-> 1248         if self.distributed_type == DistributedType.DEEPSPEED:
   1249             model_count = 0
   1250             for obj in args:

/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py in distributed_type(self)
    527     @property
    528     def distributed_type(self):
--> 529         return self.state.distributed_type
    530 
    531     @property

/usr/local/lib/python3.10/dist-packages/accelerate/state.py in __getattr__(self, name)
   1074         # so we just modify the error message
   1075         if name in self._known_attrs:
-> 1076             raise AttributeError(
   1077                 f"`AcceleratorState` object has no attribute `{name}`. "
   1078                 "This happens if `AcceleratorState._reset_state()` was called and "

AttributeError: `AcceleratorState` object has no attribute `distributed_type`. This happens if `AcceleratorState._reset_state()` was called and an `Accelerator` or `PartialState` was not reinitialized.

Expected behavior

To train the model at the end of the code.

May 16 '24 12:05 evelinamorim

What is CorefTrainer? Does it make an AcceleratorState or PartialState or something? As the error hints at, somewhere along the line the state was reset without then being called again

May 16 '24 14:05 muellerzr

I am sorry I did not specify CorefTrainer. I am using a custom trainer (you can check in this link ).

This custom trainer is a subclass of the Seq2SeqTrainer. None of the implemented functions in the custom trainer reset AcceleratorState. I went through all the of Seq2SeqTrainer and Trainer, and I was only able to identify the method create_accelerator_and_postprocess that creates the accelerator for a trainer object. I do not know if I must provide some configuration to avoid this error.

May 16 '24 15:05 evelinamorim

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 15 '24 15:06 github-actions[bot]

Are there any updates on this issue? I too am having the problem, with the same packaging.

Jun 21 '24 08:06 shuafriedman

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 15 '24 15:07 github-actions[bot]

Hi there! Are there any updates? I also met this problem.

Sep 02 '24 16:09 FluorumSoc

@FluorumSoc can you give me a full reproducer? Will be needed to help debug where something is being reset w/o an initialization

Sep 02 '24 18:09 muellerzr

@muellerzr Thanks for reply! I met the problem after moving super().init() in trl.sft_trainer.py from line 400+ to line 148, next to init() of SFTTrainer. I did this to give transformers.trainer all the args instead of being modified by SFTConfig. Now I move the super().init() back except the attribute "args", the problem disappears. But if I add "args=args" in the original place (line 400+), all of my args are lost; if I do not, I met another problem: RuntimeError: unscale_() has already been called on this optimizer since the last update(). I had met it earlier, and I read lots of issues about it and changed some code as issues said, but in vain. A few of issues told me it's because I used too little finetune data. I haven't come up with what to do yet. LLM: https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/tree/main Using Llama-7b: https://huggingface.co/huggyllama/llama-7b/tree/main

super().__init__( # From trl/sft_trainer.py. This was originally in aroud line 412. After moving, it was in line 148 model=model, # args=args, data_collator=data_collator, train_dataset=train_dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, model_init=model_init, compute_metrics=compute_metrics, callbacks=callbacks, optimizers=optimizers, preprocess_logits_for_metrics=preprocess_logits_for_metrics )

env: accelerate 0.34.0.dev0 transformers 4.45.0.dev0 trl 0.10.1 torch 2.4.0 python 3.10

Sep 02 '24 18:09 FluorumSoc

Sorry I just start to use github and have few skills. What a messy form

Sep 02 '24 18:09 FluorumSoc