transformers
transformers copied to clipboard
Ray[Tune] ValueError: checkpoint not in list (still persists with latest version of transformers)
System Info
-
transformers
version: 4.27.3 - Platform: Linux-5.10.112-108.499.amzn2.x86_64-x86_64-with-glibc2.17
- Python version: 3.8.11
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.10.1+cu102 (True)
- Tensorflow version (GPU?): 2.9.1 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Snippets of Configuration File: pbt_scheduler: time_attr: "training_iteration" metric : "eval_f1" mode : "max" synch : true hyperparam_mutations : {"weight_decay" : [0.0, 0.1], "learning_rate": [0.0000001, 0.0001]} perturbation_interval: 4
param_search: hp_spance : {"checkpoint_interval": 4, "per_device_train_batch_size": 8, "per_device_eval_batch_size": 8, "num_train_epochs": [512], "max_steps": -1} backend : "ray" n_trials : 8 max_concurrent_trials: 8 resources_per_trial: {"cpu": 12, "gpu": 1} scheduler: "pbt" keep_checkpoints_num: 1 checkpoint_score_attr: "eval_f1" checkpoint_at_end: true max_failures: 5 resume: false stop: {"eval_f1": 0.85, "training_iteration": 512} local_dir: "./_logs/tune_RAY" name: "tune_transformer_pbt" log_to_file: false`
Snippets of Code that Uses the Configuration File's Information:
def get_ray_pbt_scheduler(config):
scheduler = PopulationBasedTraining(
time_attr=config.pbt_scheduler.time_attr,
metric=config.pbt_scheduler.metric,
mode=config.pbt_scheduler.mode,
synch=config.pbt_scheduler.synch,
perturbation_interval=config.pbt_scheduler.perturbation_interval,
hyperparam_mutations={
"weight_decay": tune.uniform(*config.pbt_scheduler.
hyperparam_mutations["weight_decay"]),
"learning_rate": tune.uniform(*config.pbt_scheduler.
hyperparam_mutations["learning_rate"]),
},
)
return scheduler
def get_ray_hp_space(config): hp_space = dict() hp_space["per_device_train_batch_size"] = int( config.param_search.hp_spance.per_device_train_batch_size) hp_space["per_device_eval_batch_size"] = int( config.param_search.hp_spance.per_device_eval_batch_size) hp_space["num_train_epochs"] = tune.choice( list(config.param_search.hp_spance.num_train_epochs)) hp_space["max_steps"] = int( config.param_search.hp_spance.max_steps) return hp_space
def param_search(self): if self.config.param_search.scheduler == "pbt": scheduler = get_ray_pbt_scheduler(self.config) else: raise NotImplementedError reporter = get_ray_cli_reporter() ray_hp_space = get_ray_hp_space(self.config) best_trial = self.trainer.hyperparameter_search( scheduler=scheduler, hp_space=lambda _: ray_hp_space, progress_reporter=reporter, backend=self.config.param_search.backend, n_trials=self.config.param_search.n_trials, resources_per_trial=self.config.param_search.resources_per_trial, keep_checkpoints_num=self.config.param_search.keep_checkpoints_num, checkpoint_score_attr=self.config.param_search.checkpoint_score_attr, stop=dict(self.config.param_search.stop), local_dir=self.config.param_search.local_dir, name=self.config.param_search.name, log_to_file=self.config.param_search.log_to_file, ) save_best_trial(best_trial, self.config)
Expected behavior
Get Same error described here: https://github.com/huggingface/transformers/issues/10247
This one: best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint))) ValueError: 'results/run-34e77498/checkpoint-10' is not in list
- I am using last version of transformers. Thought this had been corrected, hasn't it?
- Is there any dependency (and specific version) I need to install to avoid this error?
This is the exact error I get:
Failure # 1 (occurred at 2023-03-29_00-47-59) [36mray::ImplicitFunc.train()[39m (pid=30599, ip=100.89.2.207, repr=_objective) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 368, in train raise skipped from exception_cause(skipped) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint return self._trainable_func( File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func output = fn() File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/integrations.py", line 336, in dynamic_modules_import_trainable return trainable(*args, **kwargs) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 398, in inner return trainable(config, **fn_kwargs) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/integrations.py", line 237, in _objective local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 1994, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2388, in _save_checkpoint self._rotate_checkpoints(use_mtime=True, output_dir=run_dir) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2875, in _rotate_checkpoints checkpoints_sorted = self._sorted_checkpoints(use_mtime=use_mtime, output_dir=output_dir) File "/opt/omniai/work/instance1/jupyter/envs/qci/lib/python3.8/site-packages/transformers/trainer.py", line 2865, in _sorted_checkpoints best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint))) ValueError: '/opt/omniai/work/instance1/jupyter/repos/qci_rates_archive/_logs/tune_RAY/tune_transformer_pbt/run-895a0_00002/checkpoint-4550' is not in list
And Get stuck here even when I use 4 trials with 4 GPUs and 48 CPUs.
== Status == Current time: 2023-03-29 00:55:48 (running for 00:27:58.76) Memory usage on this node: 40.6/186.6 GiB PopulationBasedTraining: 2 checkpoints, 2 perturbs Resources requested: 0/48 CPUs, 0/4 GPUs, 0.0/114.68 GiB heap, 0.0/53.14 GiB objects Result logdir: /opt/omniai/work/instance1/jupyter/repos/qci_rates_archive/_logs/tune_RAY/tune_transformer_pbt Number of trials: 4/4 (1 ERROR, 3 PAUSED) +------------------------+----------+--------------------+------------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+ | Trial name | status | loc | w_decay | lr | train_bs/gpu | num_epochs | eval_f1 | eval_loss | epoch | training_iteration | |------------------------+----------+--------------------+------------+-------------+----------------+--------------+-----------+-------------+---------+----------------------| | _objective_895a0_00000 | PAUSED | 100.89.2.207:11371 | 0.0124815 | 1.88206e-05 | 8 | 512 | 0.762118 | 1.76025 | 12 | 12 | | _objective_895a0_00002 | PAUSED | 100.89.2.207:48461 | 0.0156019 | 1.56839e-05 | 8 | 512 | 0.762118 | 1.76025 | 12 | 12 | | _objective_895a0_00003 | PAUSED | 100.89.2.207:48150 | 0.00580836 | 8.6631e-05 | 8 | 512 | 0.222621 | 2.26957 | 12 | 12 | | _objective_895a0_00001 | ERROR | 100.89.2.207:30599 | 0.0124815 | 1.88206e-05 | 8 | 512 | 0.754937 | 1.56503 | 9 | 9 | +------------------------+----------+--------------------+------------+-------------+----------------+--------------+-----------+-------------+---------+----------------------+ Number of errored trials: 1 +------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'm having the same issue with 4.26.0, any solution?
Same issue with transformers==4.30.2 and ray[tune]==2.51.0
This has been closed by github-actions, but the problem has not been solved ofc ...
@TimbusCalin @SergioG-M So that we can help you, could you share a minimal reproducible code snippet and information about the running environment (run transformers-cli env
in the terminal and copy-paste the output)?
@amyeroberts Sure, thank you for the prompt response. So whenever I run a PopulationBasedTraining() with perturbation_interval > 1, I get such an error, exactly like the one mentioned above. ValueError: '/opt/omniai/work/instance1/jupyter/repos/qci_rates_archive/_logs/tune_RAY/tune_transformer_pbt/run-895a0_00002/checkpoint-4550' is not in list
.
Of course the checkopoint-abcd
not in list depends on the experiment that I am doing, but what I found is that it's always happening when using PopulationBasedTraining + perturbation_interval > 1.
This is the code I have (almost copy-paste from https://docs.ray.io/en/latest/tune/examples/pbt_transformers.html#tune-huggingface-example:
"""
This example is uses the official
huggingface transformers `hyperparameter_search` API.
"""
import os
import ray
from ray import tune
from ray.tune import CLIReporter
from ray.tune.examples.pbt_transformers.utils import (
download_data,
)
from utils import compute_metrics
from ray.tune.schedulers import PopulationBasedTraining
from transformers import (
glue_tasks_num_labels,
AutoConfig,
AutoModelForSequenceClassification,
AutoTokenizer,
Trainer,
GlueDataset,
GlueDataTrainingArguments,
TrainingArguments,
)
def tune_transformer(num_samples=8, gpus_per_trial=0, smoke_test=False):
data_dir_name = "./data" if not smoke_test else "./test_data"
data_dir = os.path.abspath(os.path.join(os.getcwd(), data_dir_name))
if not os.path.exists(data_dir):
os.mkdir(data_dir, 0o755)
# Change these as needed.
model_name = (
"distilbert-base-uncased"
if not smoke_test
else "sshleifer/tiny-distilroberta-base"
)
task_name = "rte"
task_data_dir = os.path.join(data_dir, task_name.upper())
num_labels = glue_tasks_num_labels[task_name]
config = AutoConfig.from_pretrained(
model_name, num_labels=num_labels, finetuning_task=task_name
)
# Download and cache tokenizer, model, and features
print("Downloading and caching Tokenizer")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Triggers tokenizer download to cache
print("Downloading and caching pre-trained model")
AutoModelForSequenceClassification.from_pretrained(
model_name,
config=config,
)
def get_model():
return AutoModelForSequenceClassification.from_pretrained(
model_name,
config=config,
)
# Download data.
download_data(task_name, data_dir)
data_args = GlueDataTrainingArguments(task_name=task_name, data_dir=task_data_dir)
train_dataset = GlueDataset(
data_args, tokenizer=tokenizer, mode="train", cache_dir=task_data_dir
)
eval_dataset = GlueDataset(
data_args, tokenizer=tokenizer, mode="dev", cache_dir=task_data_dir
)
training_args = TrainingArguments(
output_dir=".",
learning_rate=1e-5, # config
do_train=True,
do_eval=True,
no_cuda=gpus_per_trial <= 0,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
num_train_epochs=10, # config
max_steps=-1,
per_device_train_batch_size=16, # config
per_device_eval_batch_size=16, # config
warmup_steps=0,
weight_decay=0.1, # config
logging_dir="./logs",
skip_memory_metrics=True,
report_to="none",
)
trainer = Trainer(
model_init=get_model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
tune_config = {
"per_device_train_batch_size": 32,
"per_device_eval_batch_size": 32,
"num_train_epochs": tune.choice([4,5,6,7]),
"max_steps": 1 if smoke_test else -1, # Used for smoke test.
}
scheduler = PopulationBasedTraining(
time_attr="training_iteration",
metric="eval_f1",
mode="max",
#if perturbation_interval > 1, such an error as the one below occurs
perturbation_interval=2,
hyperparam_mutations={
"weight_decay": tune.uniform(0.0, 0.3),
"learning_rate": tune.uniform(1e-5, 5e-5),
"per_device_train_batch_size": tune.choice([16, 24, 32, 48, 64]),
},
quantile_fraction=0.125,
resample_probability=0.25,
)
reporter = CLIReporter(
parameter_columns={
"weight_decay": "w_decay",
"learning_rate": "lr",
"per_device_train_batch_size": "train_bs/gpu",
"num_train_epochs": "num_epochs",
},
metric_columns=["eval_acc", "eval_loss", "eval_f1", "epoch", "training_iteration"],
max_progress_rows=40,
)
best_results = trainer.hyperparameter_search(
hp_space=lambda _: tune_config,
backend="ray",
n_trials=num_samples,
resources_per_trial={"cpu": 8, "gpu": gpus_per_trial},
scheduler=scheduler,
keep_checkpoints_num=1,
direction="maximize",
checkpoint_score_attr="training_iteration",
stop={"training_iteration": 1} if smoke_test else None,
progress_reporter=reporter,
local_dir="~/ray_results/",
name="tune_transformer_only4ptbint2",
log_to_file=True,
)
print("Best hparams", best_results.hyperparameters)
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--smoke-test",
default=False,
action="store_true",
help="Finish quickly for testing",
)
args, _ = parser.parse_known_args()
ray.init()
if args.smoke_test:
tune_transformer(num_samples=1, gpus_per_trial=0, smoke_test=True)
else:
# You can change the number of GPUs here:
tune_transformer(num_samples=4, gpus_per_trial=1)
For example, this error now:
Failure # 1 (occurred at 2023-07-05_12-40-10)
[36mray::ImplicitFunc.train()[39m (pid=74890, ip=192.168.1.139, actor_id=df9f5e052e6dd84c774b695501000000, repr=_objective)
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 389, in train
raise skipped from exception_cause(skipped)
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 336, in entrypoint
return self._trainable_func(
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 653, in _trainable_func
output = fn()
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/integrations.py", line 357, in dynamic_modules_import_trainable
return trainable(*args, **kwargs)
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 324, in inner
return trainable(config, **fn_kwargs)
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/integrations.py", line 258, in _objective
local_trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2081, in _inner_training_loop
checkpoints_sorted = self._sorted_checkpoints(use_mtime=False, output_dir=run_dir)
File "/home/calin/PycharmProjects/hparams_search/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2986, in _sorted_checkpoints
best_model_index = checkpoints_sorted.index(str(Path(self.state.best_model_checkpoint)))
ValueError: 'run-e6e7a_00003/checkpoint-78' is not in list
Hi @TimbusCalin, thanks for providing more details.
it's always happening when using PopulationBasedTraining + perturbation_interval > 1.
In this case, it seems that the issue is coming from the ray
library and its interactions with Trainer
and not something we can help with. I suggest raising an issue on ray's github, as they'll be more able to resolve this issue.
I have had the same problem. When I changed
trainer.train(resume_from_checkpoint='outputs/checkpoint-13600')
to
trainer.train(resume_from_checkpoint=True)
,
the model could be loaded normally from the outputs directory.
Hi @YanZheng-16, thanks for sharing this.
Could you confirm if this is also happening with ray
as the backend for hyperparameter tuning and with PopulationBasedTraining + perturbation_interval > 1?