anomalib icon indicating copy to clipboard operation
anomalib copied to clipboard

[Task]: How to assign a customer validation set?

Open cuicathy opened this issue 1 year ago • 7 comments

What is the motivation for this task?

Currently, the validation set is randomly split from the training set. However, I hope this library can provide an option to assign a folder with normal images (and a folder with abnormal images) to the customer validation set. This may help the results be more compatible with other methods/pipelines by assigning a fixed same training and validation set.

Describe the solution you'd like

Add some items in config.yaml file to decide 1) whether to use a customer validation set and 2) the folder of the customer validation set.

Additional context

Please let me know if this function is already provided but I missed it. Thanks!

cuicathy avatar May 03 '23 07:05 cuicathy

Hi, Anomalib's Folder dataset format might suit your use case. The directory structure of the folder dataset format is configured using the following keys:

normal_dir: The file system location of your normal images. These will go to the training set. abnormal_dir: File system location of your abnormal images. These will go to the test set. normal_test_dir: File system location of normal images reserved for testing. These will be added to the test set.

When no value is specified for normal_test_dir, Anomalib will automatically sample some images from the training set and move these to the test set. This is necessary because we always need at least a few images from both classes (normal and anomalous) to compute the evaluation metrics. The number of images sampled from the training set is determined by the test_split_ratio parameter.

Another relevant parameter is val_split_mode, which determines how the distinction between the validation set and the test set is made. When the from_test option is chosen, the validation set will be sampled from the test set. The number of samples is determined with the val_split_ratio parameter. When val_split_mode is set to same_as_test, the validation set will consist of the same samples as the test set.

More information about the Folder dataset format can be found on this page in our documentation. Please let us know if anything is still unclear or if you have any additional questions.

djdameln avatar May 03 '23 07:05 djdameln

Thanks for your reply. But it seems that we still cannot specify a certain folder for the validation set. Say I already have a train-val-test data split and they are saved in three different paths. How can I assign them to the train, val, test set in the YAML file? Thank you!

cuicathy avatar May 03 '23 17:05 cuicathy

+1 I'd love to be able to give a paths for my validation set (or test set), it's been a question before: https://github.com/openvinotoolkit/anomalib/discussions/723#discussioncomment-4192237

ugotsoul avatar May 03 '23 20:05 ugotsoul

I am kind of confused of how the validation set and test set are used here. For example, in a classification task, I provide a normal image folder and an abnormal image folder, if I have: test_split_ratio: 0.2 val_split_mode: same_as_test val_split_ratio: 0.5 Does that mean 20% of the normal images and all the abnormal images are used for test/validation set and 50% of test/validation set are used for validation? And, the test set is used in the engine.test to give a test result, while the validation set is used in engine.fit to give a metric after training is finished? Is it true that the validation process does not guide any hyper-parameter optimization, and simply indicates how the training goes?

Yiiipu avatar Feb 12 '24 01:02 Yiiipu

@djdameln

samet-akcay avatar Feb 13 '24 18:02 samet-akcay

I am kind of confused of how the validation set and test set are used here. For example, in a classification task, I provide a normal image folder and an abnormal image folder, if I have: test_split_ratio: 0.2 val_split_mode: same_as_test val_split_ratio: 0.5 Does that mean 20% of the normal images and all the abnormal images are used for test/validation set and 50% of test/validation set are used for validation? And, the test set is used in the engine.test to give a test result, while the validation set is used in engine.fit to give a metric after training is finished? Is it true that the validation process does not guide any hyper-parameter optimization, and simply indicates how the training goes?

I have been using AnomaLib for a while now but I still am confused about how validation and testing are used here

enricobv avatar Mar 28 '24 08:03 enricobv

I am kind of confused of how the validation set and test set are used here. For example, in a classification task, I provide a normal image folder and an abnormal image folder, if I have: test_split_ratio: 0.2 val_split_mode: same_as_test val_split_ratio: 0.5 Does that mean 20% of the normal images and all the abnormal images are used for test/validation set and 50% of test/validation set are used for validation?

In anomaly datasets, abnormal images are only used for validation/evaluation, never for training. So when both a folder of normal and abnormal images are provided, the normal images are initially assigned to the test set and the abnormal images are assigned to the test set. However, this would leave us with a test set without any normal images, which would prevent us from computing any meaningful evaluation metric. To solve this, we have to move some normal images from the training set to the test set. The number of normal images moved from train to test is determined by the test_split_ratio (0.2 means that 20% of the normal images in the training set are moved to the test set). Note that you could alternatively pass a folder of normal images for testing using the normal_test_dir parameter, in which case the test_split_ratio would be ignored.

There are several possibilities for obtaining a validation set. The default option (val_split_mode: same_as_test) just re-uses the test set as validation set. This is the default option because most public anomaly detection benchmark datasets such as MVTec and Visa do not provide a separate validation set. In this case, the val_split_ratio parameter is ignored (no splitting is applied to obtain the validation set). When val_split_mode is set to from_test, we randomly sample images from the test set and move them to the validation set. The val_split_ratio parameter determines the number of sampled images.

And, the test set is used in the engine.test to give a test result, while the validation set is used in engine.fit to give a metric after training is finished? Is it true that the validation process does not guide any hyper-parameter optimization, and simply indicates how the training goes?

The validation set may be used for multiple purposes. Most importantly, the validation set is used to compute normalization statistics and adaptive threshold value after the model has been fitted, which necessary because anomaly detection models generally predict an (image-level and/or pixel-level) anomaly score instead of a hard class label.

Normalization: The raw anomaly scores predicted by the model are unbounded and the expected range of values may differ depending on the chosen model. To make the anomaly score more interpretable, we need to normalize the values to the [0, 1] range. For this we use min-max normalization based on the lowest and highest anomaly scores observed in the test set.

Thresholding: To convert the raw anomaly scores to a normal vs anomalous class label, we need to apply a threshold. In Anomalib, we adaptively compute this threshold as the optimal threshold that maximizes the F1 score over the validation set.

In addition, some models may use early stopping mechanism conditioned on an evaluation metric computed over the validation set.

djdameln avatar Mar 28 '24 10:03 djdameln