transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Ray[Tune] ValueError: checkpoint not in list (still persists with latest version of transformers)

Open mvillarreal14 opened this issue 1 year ago • 3 comments

System Info

  • transformers version: 4.27.3
  • Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.17
  • Python version: 3.8.11
  • Huggingface_hub version: 0.12.0
  • PyTorch version (GPU?): 1.10.1+cu102 (True)
  • Tensorflow version (GPU?): 2.9.1 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Snippets of Configuration File: pbt_scheduler: time_attr: "training_iteration" metric : "eval_f1" mode : "max" synch : true hyperparam_mutations : {"weight_decay" : [0.0, 0.1], "learning_rate": [0.0000001, 0.0001]} perturbation_interval: 4

param_search: hp_spance : {"checkpoint_interval": 4, "per_device_train_batch_size": 8, "per_device_eval_batch_size": 8, "num_train_epochs": [512], "max_steps": -1} backend : "ray" n_trials : 8 max_concurrent_trials: 8 resources_per_trial: {"cpu": 12, "gpu": 1} scheduler: "pbt" keep_checkpoints_num: 1 checkpoint_score_attr: "eval_f1" checkpoint_at_end: true max_failures: 5 resume: false stop: {"eval_f1": 0.85, "training_iteration": 512} local_dir: "./_logs/tune_RAY" name: "tune_transformer_pbt" log_to_file: false`

Snippets of Code that Uses the Configuration File's Information:

def get_ray_pbt_scheduler(config): scheduler = PopulationBasedTraining( time_attr=config.pbt_scheduler.time_attr, metric=config.pbt_scheduler.metric, mode=config.pbt_scheduler.mode, synch=config.pbt_scheduler.synch, perturbation_interval=config.pbt_scheduler.perturbation_interval, hyperparam_mutations={ "weight_decay": tune.uniform(*config.pbt_scheduler.
hyperparam_mutations["weight_decay"]), "learning_rate": tune.uniform(*config.pbt_scheduler.
hyperparam_mutations["learning_rate"]), }, ) return scheduler

def get_ray_hp_space(config): hp_space = dict() hp_space["per_device_train_batch_size"] = int( config.param_search.hp_spance.per_device_train_batch_size) hp_space["per_device_eval_batch_size"] = int( config.param_search.hp_spance.per_device_eval_batch_size) hp_space["num_train_epochs"] = tune.choice( list(config.param_search.hp_spance.num_train_epochs)) hp_space["max_steps"] = int( config.param_search.hp_spance.max_steps) return hp_space

def param_search(self): if self.config.param_search.scheduler == "pbt": scheduler = get_ray_pbt_scheduler(self.config) else: raise NotImplementedError reporter = get_ray_cli_reporter() ray_hp_space = get_ray_hp_space(self.config) best_trial = self.trainer.hyperparameter_search( scheduler=scheduler, hp_space=lambda _: ray_hp_space, progress_reporter=reporter, backend=self.config.param_search.backend, n_trials=self.config.param_search.n_trials, resources_per_trial=self.config.param_search.resources_per_trial, keep_checkpoints_num=self.config.param_search.keep_checkpoints_num, checkpoint_score_attr=self.config.param_search.checkpoint_score_attr, stop=dict(self.config.param_search.stop), local_dir=self.config.param_search.local_dir, name=self.config.param_search.name, log_to_file=self.config.param_search.log_to_file, ) save_best_trial(best_trial, self.config)

Expected behavior

Get Same error described here: https://github.com/huggingface/transformers/issues/10247

This one: best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint))) ValueError: 'results/run-34e77498/checkpoint-10' is not in list

  • I am using last version of transformers. Thought this had been corrected, hasn't it?
  • Is there any dependency (and specific version) I need to install to avoid this error?

mvillarreal14 avatar Mar 29 '23 00:03 mvillarreal14

This is the exact error I get:

Failure # 1 (occurred at 2023-03-29_00-47-59) [36mray::ImplicitFunc.train()[39m (pid=30599, ip=100.89.2.207, repr=_objective) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 368, in train raise skipped from exception_cause(skipped) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint return self._trainable_func( File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func output = fn() File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/integrations.py", line 336, in dynamic_modules_import_trainable return trainable(*args, **kwargs) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 398, in inner return trainable(config, **fn_kwargs) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/integrations.py", line 237, in _objective local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 1994, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2388, in _save_checkpoint self._rotate_checkpoints(use_mtime=True, output_dir=run_dir) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2875, in _rotate_checkpoints checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime, output_dir=output_dir) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2865, in _sorted_checkpoints best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint))) ValueError: '/opt/omniai/work/instance1/jupyter/repos/qci_rates_archive/_logs/tune_RAY/tune_transformer_pbt/run-895a0_00002/checkpoint-4550' is not in list

mvillarreal14 avatar Mar 29 '23 00:03 mvillarreal14

And Get stuck here even when I use 4 trials with 4 GPUs and 48 CPUs.

== Status == Current time: 2023-03-29 00:55:48 (running for 00:27:58.76) Memory usage on this node: 40.6/186.6 GiB PopulationBasedTraining: 2 checkpoints, 2 perturbs Resources requested: 0/48 CPUs, 0/4 GPUs, 0.0/114.68 GiB heap, 0.0/53.14 GiB objects Result logdir: /opt/omniai/work/instance1/jupyter/repos/qci_rates_archive/_logs/tune_RAY/tune_transformer_pbt Number of trials: 4/4 (1 ERROR, 3 PAUSED) +------------------------+----------+--------------------+------------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+ | Trial name | status | loc | w_decay | lr | train_bs/gpu | num_epochs | eval_f1 | eval_loss | epoch | training_iteration | |------------------------+----------+--------------------+------------+-------------+----------------+--------------+-----------+-------------+---------+----------------------| | _objective_895a0_00000 | PAUSED | 100.89.2.207:11371 | 0.0124815 | 1.88206e-05 | 8 | 512 | 0.762118 | 1.76025 | 12 | 12 | | _objective_895a0_00002 | PAUSED | 100.89.2.207:48461 | 0.0156019 | 1.56839e-05 | 8 | 512 | 0.762118 | 1.76025 | 12 | 12 | | _objective_895a0_00003 | PAUSED | 100.89.2.207:48150 | 0.00580836 | 8.6631e-05 | 8 | 512 | 0.222621 | 2.26957 | 12 | 12 | | _objective_895a0_00001 | ERROR | 100.89.2.207:30599 | 0.0124815 | 1.88206e-05 | 8 | 512 | 0.754937 | 1.56503 | 9 | 9 | +------------------------+----------+--------------------+------------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+ Number of errored trials: 1 +------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

mvillarreal14 avatar Mar 29 '23 00:03 mvillarreal14

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 28 '23 15:04 github-actions[bot]

I'm having the same issue with 4.26.0, any solution?

SergioG-M avatar May 24 '23 14:05 SergioG-M

Same issue with transformers==4.30.2 and ray[tune]==2.51.0

TimbusCalin avatar Jul 04 '23 19:07 TimbusCalin

This has been closed by github-actions, but the problem has not been solved ofc ...

TimbusCalin avatar Jul 04 '23 20:07 TimbusCalin

@TimbusCalin @SergioG-M So that we can help you, could you share a minimal reproducible code snippet and information about the running environment (run transformers-cli env in the terminal and copy-paste the output)?

amyeroberts avatar Jul 04 '23 21:07 amyeroberts

@amyeroberts Sure, thank you for the prompt response. So whenever I run a PopulationBasedTraining() with perturbation_interval > 1, I get such an error, exactly like the one mentioned above. ValueError: '/opt/omniai/work/instance1/jupyter/repos/qci_rates_archive/_logs/tune_RAY/tune_transformer_pbt/run-895a0_00002/checkpoint-4550' is not in list.

Of course the checkopoint-abcd not in list depends on the experiment that I am doing, but what I found is that it's always happening when using PopulationBasedTraining + perturbation_interval > 1.

This is the code I have (almost copy-paste from https://docs.ray.io/en/latest/tune/examples/pbt_transformers.html#tune-huggingface-example:

"""
This example is uses the official
huggingface transformers `hyperparameter_search` API.
"""
import os

import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.examples.pbt_transformers.utils import (
    download_data,
)
from utils import compute_metrics
from ray.tune.schedulers import PopulationBasedTraining
from transformers import (
    glue_tasks_num_labels,
    AutoConfig,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    GlueDataset,
    GlueDataTrainingArguments,
    TrainingArguments,
)


def tune_transformer(num_samples=8, gpus_per_trial=0, smoke_test=False):
    data_dir_name = "./data" if not smoke_test else "./test_data"
    data_dir = os.path.abspath(os.path.join(os.getcwd(), data_dir_name))
    if not os.path.exists(data_dir):
        os.mkdir(data_dir, 0o755)

    # Change these as needed.
    model_name = (
        "distilbert-base-uncased"
        if not smoke_test
        else "sshleifer/tiny-distilroberta-base"
    )
    task_name = "rte"

    task_data_dir = os.path.join(data_dir, task_name.upper())

    num_labels = glue_tasks_num_labels[task_name]

    config = AutoConfig.from_pretrained(
        model_name, num_labels=num_labels, finetuning_task=task_name
    )

    # Download and cache tokenizer, model, and features
    print("Downloading and caching Tokenizer")
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Triggers tokenizer download to cache
    print("Downloading and caching pre-trained model")
    AutoModelForSequenceClassification.from_pretrained(
        model_name,
        config=config,
    )

    def get_model():
        return AutoModelForSequenceClassification.from_pretrained(
            model_name,
            config=config,
        )

    # Download data.
    download_data(task_name, data_dir)

    data_args = GlueDataTrainingArguments(task_name=task_name, data_dir=task_data_dir)

    train_dataset = GlueDataset(
        data_args, tokenizer=tokenizer, mode="train", cache_dir=task_data_dir
    )
    eval_dataset = GlueDataset(
        data_args, tokenizer=tokenizer, mode="dev", cache_dir=task_data_dir
    )

    training_args = TrainingArguments(
        output_dir=".",
        learning_rate=1e-5,  # config
        do_train=True,
        do_eval=True,
        no_cuda=gpus_per_trial <= 0,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        num_train_epochs=10,  # config
        max_steps=-1,
        per_device_train_batch_size=16,  # config
        per_device_eval_batch_size=16,  # config
        warmup_steps=0,
        weight_decay=0.1,  # config
        logging_dir="./logs",
        skip_memory_metrics=True,
        report_to="none",
    )

    trainer = Trainer(
        model_init=get_model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

    tune_config = {
        "per_device_train_batch_size": 32,
        "per_device_eval_batch_size": 32,
        "num_train_epochs": tune.choice([4,5,6,7]),
        "max_steps": 1 if smoke_test else -1,  # Used for smoke test.
    }

    scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        metric="eval_f1",
        mode="max",
        #if perturbation_interval > 1, such an error as the one below occurs
        perturbation_interval=2, 
        hyperparam_mutations={
            "weight_decay": tune.uniform(0.0, 0.3),
            "learning_rate": tune.uniform(1e-5, 5e-5),
            "per_device_train_batch_size": tune.choice([16, 24, 32, 48, 64]),
        },
        quantile_fraction=0.125,
        resample_probability=0.25,
    )

    reporter = CLIReporter(
        parameter_columns={
            "weight_decay": "w_decay",
            "learning_rate": "lr",
            "per_device_train_batch_size": "train_bs/gpu",
            "num_train_epochs": "num_epochs",
        },
        metric_columns=["eval_acc", "eval_loss", "eval_f1", "epoch", "training_iteration"],
        max_progress_rows=40,
    )

    best_results = trainer.hyperparameter_search(
        hp_space=lambda _: tune_config,
        backend="ray",
        n_trials=num_samples,
        resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
        scheduler=scheduler,
        keep_checkpoints_num=1,
        direction="maximize",
        checkpoint_score_attr="training_iteration",
        stop={"training_iteration": 1} if smoke_test else None,
        progress_reporter=reporter,
        local_dir="~/ray_results/",
        name="tune_transformer_only4ptbint2",
        log_to_file=True,
    )
    print("Best hparams", best_results.hyperparameters)


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--smoke-test",
        default=False,
        action="store_true",
        help="Finish quickly for testing",
    )
    args, _ = parser.parse_known_args()

    ray.init()

    if args.smoke_test:
        tune_transformer(num_samples=1, gpus_per_trial=0, smoke_test=True)
    else:
        # You can change the number of GPUs here:
        tune_transformer(num_samples=4, gpus_per_trial=1)

For example, this error now:

Failure # 1 (occurred at 2023-07-05_12-40-10)
[36mray::ImplicitFunc.train()[39m (pid=74890, ip=192.168.1.139, actor_id=df9f5e052e6dd84c774b695501000000, repr=_objective)
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 389, in train
    raise skipped from exception_cause(skipped)
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint
    return self._trainable_func(
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func
    output = fn()
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/integrations.py", line 357, in dynamic_modules_import_trainable
    return trainable(*args, **kwargs)
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 324, in inner
    return trainable(config, **fn_kwargs)
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/integrations.py", line 258, in _objective
    local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2081, in _inner_training_loop
    checkpoints_sorted = self._sorted_checkpoints(use_mtime=False, output_dir=run_dir)
  File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2986, in _sorted_checkpoints
    best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
ValueError: 'run-e6e7a_00003/checkpoint-78' is not in list

TimbusCalin avatar Jul 05 '23 07:07 TimbusCalin

Hi @TimbusCalin, thanks for providing more details.

it's always happening when using PopulationBasedTraining + perturbation_interval > 1.

In this case, it seems that the issue is coming from the ray library and its interactions with Trainer and not something we can help with. I suggest raising an issue on ray's github, as they'll be more able to resolve this issue.

amyeroberts avatar Jul 11 '23 15:07 amyeroberts

I have had the same problem. When I changed trainer.train(resume_from_checkpoint='outputs/checkpoint-13600') to trainer.train(resume_from_checkpoint=True), the model could be loaded normally from the outputs directory.

YanZheng-16 avatar Aug 17 '23 01:08 YanZheng-16

Hi @YanZheng-16, thanks for sharing this.

Could you confirm if this is also happening with ray as the backend for hyperparameter tuning and with PopulationBasedTraining + perturbation_interval > 1?

amyeroberts avatar Aug 17 '23 09:08 amyeroberts