[Bug]: Unable to initialize AnomalibMlflowLogger inside engine.from_config
Describe the bug
I'm trying to initialize mlflow with the from_config function inside the engine class. For this purpose, I'm calling the function like this:
engine, model, datamodule = Engine.from_config(config_path, **mlflow_logger)
The mlflow_logger variable looks like this:
mlflow_logger = {
'trainer': {
'logger': {
'class_path': 'anomalib.loggers.mlflow.AnomalibMLFlowLogger',
'init_args': {
'experiment_name': custom_config["experiment_tracking"]["experiment_name"],
'run_name': run_name,
'tracking_uri': mlflow_tracking_uri,
'log_model': True,
'run_id': run.info.run_id,
}
}
}
}
The problem is that I get an assertion error from lightning side:
File "venv2\Lib\site-packages\lightning\pytorch\cli.py", line 252, in setup assert log_dir is not None ^^^^^^^^^^^^^^^^^^^
I tried already with lightning 2.1, 2.2 and 2.3 and they are all giving me the same assertion error.
I get the error when I call engine.fit, not before. And this happens with the MLFlowLogger form lightning anf the AnomalibMLFlowLogger.
Dataset
Folder
Model
Other (please specify in the field below)
Steps to reproduce the behavior
- Initialize mlflow_configuration dictionary
- Initialize engine, datamodule and model with the Engine.from_config(path, **mlflow_configuration)
- Call engine.fit function
OS information
OS information:
- OS: Windows 11
- Python version: 3.11
- Anomalib version: 1.1.0
- PyTorch version: 2.2.2
Expected behavior
The MlflowLogger gets passed to the Engine correctly and the fit method uses that logger to log.
Screenshots
No response
Pip/GitHub
pip
What version/branch did you use?
No response
Configuration YAML
data:
class_path: anomalib.data.Folder
init_args:
name: name
root: some_folder
normal_dir: some folder
abnormal_dir:
- some category
normal_test_dir: null
mask_dir: masks
normal_split_ratio: 0
extensions: null
train_batch_size: 1
eval_batch_size: 1
num_workers: 2
image_size:
- 256
- 256
transform: null
train_transform: null
eval_transform: null
test_split_mode: from_dir
test_split_ratio: 0.3
val_split_mode: same_as_test
val_split_ratio: 0.3
seed: 42
model:
class_path: anomalib.models.EfficientAd
init_args:
imagenet_dir: datasets/imagenette
teacher_out_channels: 384
model_size: S
lr: 0.0001
weight_decay: 0.00001
padding: false
pad_maps: true
metrics:
image:
- AUROC
- AUPR
pixel: null
threshold:
class_path: anomalib.metrics.F1AdaptiveThreshold
init_args:
default_value: 0.5
thresholds: null
ignore_index: null
validate_args: true
compute_on_cpu: false
dist_sync_on_step: false
sync_on_compute: true
compute_with_cache: true
logging:
log_graph: true
trainer:
accelerator: auto
strategy: auto
devices: 1
num_nodes: 1
precision: 32
fast_dev_run: false
max_epochs: 2
min_epochs: null
max_steps: 70000
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
overfit_batches: 0.0
val_check_interval: 1.0
check_val_every_n_epoch: 1
num_sanity_val_steps: 0
log_every_n_steps: 50
enable_checkpointing: true
enable_progress_bar: true
enable_model_summary: true
accumulate_grad_batches: 1
gradient_clip_val: 0
gradient_clip_algorithm: norm
deterministic: false
benchmark: null
inference_mode: true
use_distributed_sampler: true
profiler: null
detect_anomaly: false
barebones: false
plugins: null
sync_batchnorm: false
reload_dataloaders_every_n_epochs: 0
normalization:
normalization_method: min_max
task: segmentation
default_root_dir: results
Logs
File "train_anomalib.py", line 91, in train
engine.fit(model=model, datamodule=datamodule)
File "venv2\Lib\site-packages\anomalib\engine\engine.py", line 540, in fit
self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
File "venv2\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "venv2\Lib\site-packages\lightning\pytorch\trainer\call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "venv2\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "venv2\Lib\site-packages\lightning\pytorch\trainer\trainer.py", line 948, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "venv2\Lib\site-packages\lightning\pytorch\trainer\call.py", line 95, in _call_setup_hook
_call_callback_hooks(trainer, "setup", stage=fn)
File "venv2\Lib\site-packages\lightning\pytorch\trainer\call.py", line 210, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "venv2\Lib\site-packages\lightning\pytorch\cli.py", line 252, in setup
assert log_dir is not None
^^^^^^^^^^^^^^^^^^^
AssertionError
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
@harimkang, any thoughts?
Let me check!
https://github.com/Lightning-AI/pytorch-lightning/blob/f91349c961103af48091654775248789b6e03bd1/src/lightning/pytorch/trainer/trainer.py#L1213-L1234
I think the log_dir in lightning.Trainer expects the Logger to have a log_dir, but the AnomalibMLFlowLogger has save_dir instead of log_dir.
So I think it's the wrong logger, could you please confirm? @ashwinvaidya17
The base class uses save_dir that's why AnomalibMLFlowLogger has that parameter. I am not sure if that's the cause of the error.
@jrusbo I am not able to reproduce your error. Here are my details. The only difference is that I am using MVTec dataset.
From your yaml file I can also see that compute_with_cache: true is a new parameter.
anomalib: main branch
pytorch-lightning==2.2.0.post0
torch==2.1.2+cu118
torchaudio==2.2.1+cu118
torchmetrics==0.10.3
torchvision==0.16.2+cu118
from anomalib.engine import Engine
if __name__ == "__main__":
mlflow_configuration = {
"trainer": {
"logger": {
"class_path": "anomalib.loggers.mlflow.AnomalibMLFlowLogger",
"init_args": {
"experiment_name": "test_experiment",
"run_name": "test_run",
"log_model": True,
},
},
},
}
engine, model, datamodule = Engine.from_config("./nbs/bug_2209.yaml", **mlflow_configuration)
engine.fit(model, datamodule)
YAML
# anomalib==1.2.0dev
model:
class_path: anomalib.models.EfficientAd
init_args:
imagenet_dir: datasets/imagenette
teacher_out_channels: 384
model_size: S
lr: 0.0001
weight_decay: 0.00001
padding: false
pad_maps: true
metrics:
image:
- AUROC
- AUPR
pixel: null
threshold:
class_path: anomalib.metrics.F1AdaptiveThreshold
init_args:
default_value: 0.5
thresholds: null
ignore_index: null
validate_args: true
compute_on_cpu: false
dist_sync_on_step: false
sync_on_compute: true
logging:
log_graph: true
trainer:
accelerator: auto
strategy: auto
devices: 1
num_nodes: 1
precision: 32
fast_dev_run: false
max_epochs: 2
min_epochs: null
max_steps: 70000
min_steps: null
max_time: null
limit_train_batches: 1.0
limit_val_batches: 1.0
limit_test_batches: 1.0
limit_predict_batches: 1.0
overfit_batches: 0.0
val_check_interval: 1.0
check_val_every_n_epoch: 1
num_sanity_val_steps: 0
log_every_n_steps: 50
enable_checkpointing: true
enable_progress_bar: true
enable_model_summary: true
accumulate_grad_batches: 1
gradient_clip_val: 0
gradient_clip_algorithm: norm
deterministic: false
benchmark: null
inference_mode: true
use_distributed_sampler: true
profiler: null
detect_anomaly: false
barebones: false
plugins: null
sync_batchnorm: false
reload_dataloaders_every_n_epochs: 0
normalization:
normalization_method: min_max
task: segmentation
default_root_dir: results
data:
class_path: anomalib.data.MVTec
init_args:
root: datasets/MVTec
category: bottle
train_batch_size: 1
eval_batch_size: 1
num_workers: 8
image_size: null
transform: null
train_transform: null
eval_transform: null
test_split_mode: FROM_DIR
test_split_ratio: 0.2
val_split_mode: SAME_AS_TEST
val_split_ratio: 0.5
seed: null
@ashwinvaidya17 the code you provided also works for me. I could check that the problem comes when I put a tracking_uri in the mlflow_configuration. Maybe it is a mlflow problem? Is it possible for you to check that you get the error if you put a tracking_uri inside the mlflow_configuration?
Hello! I have the same problem when I am trying to use an mlflow server. I used the code provided by @ashwinvaidya17 adding tracking_uri to the arguments.
@lilruwu It's issue with Lightning CLI, you can check this thread for more information -> #16310
Everything works fine until you don't specify tracking_uri argument for MLFlowLogger.
For now you can just add TensorBoard logger with some 'save_dir' to your config like this:
trainer:
logger:
- class_path: anomalib.loggers.AnomalibTensorBoardLogger
init_args:
save_dir: "tb_logs"
- class_path: anomalib.loggers.AnomalibMLFlowLogger
init_args:
experiment_name: "SuperSecretExperiment"
tracking_uri: "http://tracking_uri:8080"
This fixed it for me.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue was closed because it has been stalled for 14 days with no activity.