anomalib icon indicating copy to clipboard operation
anomalib copied to clipboard

[Bug]: Unable to initialize AnomalibMlflowLogger inside engine.from_config

Open jrusbo opened this issue 1 year ago • 7 comments

Describe the bug

I'm trying to initialize mlflow with the from_config function inside the engine class. For this purpose, I'm calling the function like this: engine, model, datamodule = Engine.from_config(config_path, **mlflow_logger)

The mlflow_logger variable looks like this:

mlflow_logger = {
            'trainer': {
                'logger': {
                    'class_path': 'anomalib.loggers.mlflow.AnomalibMLFlowLogger',
                    'init_args': {
                        'experiment_name': custom_config["experiment_tracking"]["experiment_name"],
                        'run_name': run_name,
                        'tracking_uri': mlflow_tracking_uri,
                        'log_model': True,
                        'run_id': run.info.run_id,
                    }
                }
            }
        }

The problem is that I get an assertion error from lightning side:

File "venv2\Lib\site-packages\lightning\pytorch\cli.py", line 252, in setup assert log_dir is not None ^^^^^^^^^^^^^^^^^^^

I tried already with lightning 2.1, 2.2 and 2.3 and they are all giving me the same assertion error.

I get the error when I call engine.fit, not before. And this happens with the MLFlowLogger form lightning anf the AnomalibMLFlowLogger.

Dataset

Folder

Model

Other (please specify in the field below)

Steps to reproduce the behavior

  1. Initialize mlflow_configuration dictionary
  2. Initialize engine, datamodule and model with the Engine.from_config(path, **mlflow_configuration)
  3. Call engine.fit function

OS information

OS information:

  • OS: Windows 11
  • Python version: 3.11
  • Anomalib version: 1.1.0
  • PyTorch version: 2.2.2

Expected behavior

The MlflowLogger gets passed to the Engine correctly and the fit method uses that logger to log.

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

No response

Configuration YAML

data:
  class_path: anomalib.data.Folder
  init_args:
    name: name
    root: some_folder
    normal_dir: some folder
    abnormal_dir:
      - some category
    normal_test_dir: null
    mask_dir:  masks
    normal_split_ratio: 0
    extensions: null
    train_batch_size: 1
    eval_batch_size: 1
    num_workers: 2
    image_size:
      - 256
      - 256
    transform: null
    train_transform: null
    eval_transform: null
    test_split_mode: from_dir
    test_split_ratio: 0.3
    val_split_mode: same_as_test
    val_split_ratio: 0.3
    seed: 42

model:
  class_path: anomalib.models.EfficientAd
  init_args:
    imagenet_dir: datasets/imagenette
    teacher_out_channels: 384
    model_size: S
    lr: 0.0001
    weight_decay: 0.00001
    padding: false
    pad_maps: true
metrics:
  image:
    - AUROC
    - AUPR
  pixel: null
  threshold:
    class_path: anomalib.metrics.F1AdaptiveThreshold
    init_args:
      default_value: 0.5
      thresholds: null
      ignore_index: null
      validate_args: true
      compute_on_cpu: false
      dist_sync_on_step: false
      sync_on_compute: true
      compute_with_cache: true

logging:
  log_graph: true

trainer:
  accelerator: auto
  strategy: auto
  devices: 1
  num_nodes: 1
  precision: 32
  fast_dev_run: false
  max_epochs: 2
  min_epochs: null
  max_steps: 70000
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  overfit_batches: 0.0
  val_check_interval: 1.0
  check_val_every_n_epoch: 1
  num_sanity_val_steps: 0
  log_every_n_steps: 50
  enable_checkpointing: true
  enable_progress_bar: true
  enable_model_summary: true
  accumulate_grad_batches: 1
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  deterministic: false
  benchmark: null
  inference_mode: true
  use_distributed_sampler: true
  profiler: null
  detect_anomaly: false
  barebones: false
  plugins: null
  sync_batchnorm: false
  reload_dataloaders_every_n_epochs: 0

normalization:
  normalization_method: min_max

task: segmentation
default_root_dir: results

Logs

File "train_anomalib.py", line 91, in train
    engine.fit(model=model, datamodule=datamodule)
  File "venv2\Lib\site-packages\anomalib\engine\engine.py", line 540, in fit
    self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
  File "venv2\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 543, in fit
    call._call_and_handle_interrupt(
  File "venv2\Lib\site-packages\lightning\pytorch\trainer\call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv2\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 579, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "venv2\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 948, in _run
    call._call_setup_hook(self)  # allow user to set up LightningModule in accelerator environment
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv2\Lib\site-packages\lightning\pytorch\trainer\call.py", line 95, in _call_setup_hook
    _call_callback_hooks(trainer, "setup", stage=fn)
  File "venv2\Lib\site-packages\lightning\pytorch\trainer\call.py", line 210, in _call_callback_hooks
    fn(trainer, trainer.lightning_module, *args, **kwargs)
  File "venv2\Lib\site-packages\lightning\pytorch\cli.py", line 252, in setup
    assert log_dir is not None
           ^^^^^^^^^^^^^^^^^^^
AssertionError

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

jrusbo avatar Jul 23 '24 14:07 jrusbo

@harimkang, any thoughts?

samet-akcay avatar Jul 24 '24 08:07 samet-akcay

Let me check!

harimkang avatar Jul 24 '24 08:07 harimkang

https://github.com/Lightning-AI/pytorch-lightning/blob/f91349c961103af48091654775248789b6e03bd1/src/lightning/pytorch/trainer/trainer.py#L1213-L1234 I think the log_dir in lightning.Trainer expects the Logger to have a log_dir, but the AnomalibMLFlowLogger has save_dir instead of log_dir.

So I think it's the wrong logger, could you please confirm? @ashwinvaidya17

harimkang avatar Jul 24 '24 08:07 harimkang

The base class uses save_dir that's why AnomalibMLFlowLogger has that parameter. I am not sure if that's the cause of the error. @jrusbo I am not able to reproduce your error. Here are my details. The only difference is that I am using MVTec dataset. From your yaml file I can also see that compute_with_cache: true is a new parameter.

anomalib: main branch
pytorch-lightning==2.2.0.post0
torch==2.1.2+cu118
torchaudio==2.2.1+cu118
torchmetrics==0.10.3
torchvision==0.16.2+cu118
from anomalib.engine import Engine

if __name__ == "__main__":
    mlflow_configuration = {
        "trainer": {
            "logger": {
                "class_path": "anomalib.loggers.mlflow.AnomalibMLFlowLogger",
                "init_args": {
                    "experiment_name": "test_experiment",
                    "run_name": "test_run",
                    "log_model": True,
                },
            },
        },
    }
    engine, model, datamodule = Engine.from_config("./nbs/bug_2209.yaml", **mlflow_configuration)
    engine.fit(model, datamodule)

YAML

# anomalib==1.2.0dev
model:
  class_path: anomalib.models.EfficientAd
  init_args:
    imagenet_dir: datasets/imagenette
    teacher_out_channels: 384
    model_size: S
    lr: 0.0001
    weight_decay: 0.00001
    padding: false
    pad_maps: true
metrics:
  image:
    - AUROC
    - AUPR
  pixel: null
  threshold:
    class_path: anomalib.metrics.F1AdaptiveThreshold
    init_args:
      default_value: 0.5
      thresholds: null
      ignore_index: null
      validate_args: true
      compute_on_cpu: false
      dist_sync_on_step: false
      sync_on_compute: true

logging:
  log_graph: true

trainer:
  accelerator: auto
  strategy: auto
  devices: 1
  num_nodes: 1
  precision: 32
  fast_dev_run: false
  max_epochs: 2
  min_epochs: null
  max_steps: 70000
  min_steps: null
  max_time: null
  limit_train_batches: 1.0
  limit_val_batches: 1.0
  limit_test_batches: 1.0
  limit_predict_batches: 1.0
  overfit_batches: 0.0
  val_check_interval: 1.0
  check_val_every_n_epoch: 1
  num_sanity_val_steps: 0
  log_every_n_steps: 50
  enable_checkpointing: true
  enable_progress_bar: true
  enable_model_summary: true
  accumulate_grad_batches: 1
  gradient_clip_val: 0
  gradient_clip_algorithm: norm
  deterministic: false
  benchmark: null
  inference_mode: true
  use_distributed_sampler: true
  profiler: null
  detect_anomaly: false
  barebones: false
  plugins: null
  sync_batchnorm: false
  reload_dataloaders_every_n_epochs: 0

normalization:
  normalization_method: min_max

task: segmentation
default_root_dir: results

data:
  class_path: anomalib.data.MVTec
  init_args:
    root: datasets/MVTec
    category: bottle
    train_batch_size: 1
    eval_batch_size: 1
    num_workers: 8
    image_size: null
    transform: null
    train_transform: null
    eval_transform: null
    test_split_mode: FROM_DIR
    test_split_ratio: 0.2
    val_split_mode: SAME_AS_TEST
    val_split_ratio: 0.5
    seed: null

ashwinvaidya17 avatar Jul 24 '24 11:07 ashwinvaidya17

@ashwinvaidya17 the code you provided also works for me. I could check that the problem comes when I put a tracking_uri in the mlflow_configuration. Maybe it is a mlflow problem? Is it possible for you to check that you get the error if you put a tracking_uri inside the mlflow_configuration?

jrusbo avatar Jul 25 '24 13:07 jrusbo

Hello! I have the same problem when I am trying to use an mlflow server. I used the code provided by @ashwinvaidya17 adding tracking_uri to the arguments.

lilruwu avatar Aug 29 '24 13:08 lilruwu

@lilruwu It's issue with Lightning CLI, you can check this thread for more information -> #16310

Everything works fine until you don't specify tracking_uri argument for MLFlowLogger.

For now you can just add TensorBoard logger with some 'save_dir' to your config like this:

trainer:
  logger:
    - class_path: anomalib.loggers.AnomalibTensorBoardLogger
      init_args:
        save_dir: "tb_logs"
    - class_path: anomalib.loggers.AnomalibMLFlowLogger
      init_args:
        experiment_name: "SuperSecretExperiment"
        tracking_uri: "http://tracking_uri:8080"

This fixed it for me.

graeb avatar Sep 17 '24 12:09 graeb

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 30 '25 05:06 github-actions[bot]

This issue was closed because it has been stalled for 14 days with no activity.

github-actions[bot] avatar Jul 14 '25 05:07 github-actions[bot]