ray hyperparameter_search - ModuleNotFoundError: No module named 'evaluate_modules'
System Info
transformersversion: 4.26.1- Platform: Linux-5.4.0-1097-aws-x86_64-with-glibc2.35
- Python version: 3.10.6
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): 2.11.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: not explicitly, selected "ray" as
trainer.hyperparameter_searchbackend on a Databricks cluster with 2 workers
Who can help?
@richardliaw, @amogkam
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Note
I do see a similar issue https://github.com/huggingface/transformers/issues/11565, would similar fix also apply for this case?
Code snippet
"""
tokenizer = ...
small_train_dataset = ...
small_test_dataset = ...
data_collator = ...
"""
###
import numpy as np
import evaluate
f1_metric = evaluate.load("f1")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return f1_metric.compute(predictions=predictions, references=labels)
###
from transformers import AutoModelForSequenceClassification
def model_init():
return AutoModelForSequenceClassification.from_pretrained(
base_model, num_labels=2, return_dict=True)
###
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir=training_output_dir, evaluation_strategy="steps", eval_steps=500, save_total_limit=20, disable_tqdm=True)
###
trainer = Trainer(
args=training_args,
tokenizer=tokenizer,
train_dataset=small_train_dataset,
eval_dataset=small_test_dataset,
model_init=model_init,
compute_metrics=compute_metrics, # uses compute_metrics defined above
data_collator=data_collator,
)
###
# the code that triggered error
trainer.hyperparameter_search(
direction="maximize",
backend="ray",
n_trials=10 # number of trials
)
Error Message
The same error showed up for each trial (all 10 trials failed),
2023-03-24 13:08:07,642 ERROR trial_runner.py:1062 -- Trial _objective_d2895_00000: Error processing event.
Traceback (most recent call last):
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
future_result = ray.get(ready_future)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/_private/worker.py", line 2380, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ImplicitFunc.train() (pid=1068, ip=10.68.133.32, repr=_objective)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 368, in train
raise skipped from exception_cause(skipped)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint
return self._trainable_func(
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func
output = fn()
File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 332, in dynamic_modules_import_trainable
return trainable(*args, **kwargs)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 397, in inner
fn_kwargs[k] = parameter_registry.get(prefix + k)
File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/registry.py", line 244, in get
return ray.get(self.references[k])
ray.exceptions.RaySystemError: System error: No module named 'evaluate_modules'
traceback: Traceback (most recent call last):
ModuleNotFoundError: No module named 'evaluate_modules'
Expected behavior
According to the blog post (https://huggingface.co/blog/ray-tune), I would expect each trial to complete without errors.
can you try moving import evaluate, f1_metric, and compute_metrics into model_init for now?
this is a workaround that should unblock you.
we need to fix this import same way as this previous PR: https://github.com/huggingface/transformers/pull/12749
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.