transformers icon indicating copy to clipboard operation
transformers copied to clipboard

ray hyperparameter_search - ModuleNotFoundError: No module named 'evaluate_modules'

Open es94129 opened this issue 2 years ago • 2 comments

System Info

  • transformers version: 4.26.1
  • Platform: Linux-5.4.0-1097-aws-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.13.3
  • PyTorch version (GPU?): 1.13.1+cu117 (True)
  • Tensorflow version (GPU?): 2.11.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: not explicitly, selected "ray" as trainer.hyperparameter_search backend on a Databricks cluster with 2 workers

Who can help?

@richardliaw, @amogkam

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Note

I do see a similar issue https://github.com/huggingface/transformers/issues/11565, would similar fix also apply for this case?

Code snippet

"""
tokenizer = ...
small_train_dataset = ...
small_test_dataset = ...
data_collator = ...
"""

###

import numpy as np
import evaluate
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return f1_metric.compute(predictions=predictions, references=labels)

###

from transformers import AutoModelForSequenceClassification

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        base_model, num_labels=2, return_dict=True)

###

from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir=training_output_dir, evaluation_strategy="steps", eval_steps=500, save_total_limit=20, disable_tqdm=True)

### 

trainer = Trainer(
  args=training_args,
  tokenizer=tokenizer,
  train_dataset=small_train_dataset,
  eval_dataset=small_test_dataset,
  model_init=model_init,
  compute_metrics=compute_metrics,   # uses compute_metrics defined above
  data_collator=data_collator,
)

###

# the code that triggered error

trainer.hyperparameter_search(
  direction="maximize", 
  backend="ray", 
  n_trials=10 # number of trials
)

Error Message

The same error showed up for each trial (all 10 trials failed),

2023-03-24 13:08:07,642	ERROR trial_runner.py:1062 -- Trial _objective_d2895_00000: Error processing event.
Traceback (most recent call last):
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/execution/ray_trial_executor.py", line 1276, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/_private/worker.py", line 2380, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::ImplicitFunc.train() (pid=1068, ip=10.68.133.32, repr=_objective)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 368, in train
    raise skipped from exception_cause(skipped)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 337, in entrypoint
    return self._trainable_func(
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/function_trainable.py", line 654, in _trainable_func
    output = fn()
  File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 332, in dynamic_modules_import_trainable
    return trainable(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/trainable/util.py", line 397, in inner
    fn_kwargs[k] = parameter_registry.get(prefix + k)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-585a9e45-1e91-40e0-a214-8e2132580d15/lib/python3.10/site-packages/ray/tune/registry.py", line 244, in get
    return ray.get(self.references[k])
ray.exceptions.RaySystemError: System error: No module named 'evaluate_modules'
traceback: Traceback (most recent call last):
ModuleNotFoundError: No module named 'evaluate_modules'

Expected behavior

According to the blog post (https://huggingface.co/blog/ray-tune), I would expect each trial to complete without errors.

es94129 avatar Mar 27 '23 18:03 es94129

can you try moving import evaluate, f1_metric, and compute_metrics into model_init for now? this is a workaround that should unblock you. we need to fix this import same way as this previous PR: https://github.com/huggingface/transformers/pull/12749

gjoliver avatar Mar 30 '23 06:03 gjoliver

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 27 '23 15:04 github-actions[bot]