📋 [TASK] Implement Multi-GPU Training Support
Implement Multi-GPU Support in Anomalib
Depends on:
- [x] https://github.com/openvinotoolkit/anomalib/issues/2257
- [ ] https://github.com/openvinotoolkit/anomalib/issues/2365
- [ ] https://github.com/openvinotoolkit/anomalib/issues/2366
- [ ] https://github.com/openvinotoolkit/anomalib/issues/2367
- [ ] https://github.com/openvinotoolkit/anomalib/issues/2368
- [ ] https://github.com/openvinotoolkit/anomalib/issues/2369
Background
Anomalib currently uses PyTorch Lightning under the hood, which provides built-in support for multi-GPU training. However, Anomalib itself does not yet expose this functionality to users. Implementing multi-GPU support would significantly enhance the library's capabilities, allowing for faster training on larger datasets and more complex models.
Proposed Feature
Enable multi-GPU support in Anomalib, allowing users to easily utilize multiple GPUs for training without changing their existing code structure significantly.
Example Usage
Users should be able to enable multi-GPU training by simply specifying the number of devices in the Engine configuration:
from anomalib.data import MVTec
from anomalib.engine import Engine
from anomalib.models import EfficientAd
datamodule = MVTec(train_batch_size=1)
model = EfficientAd()
engine = Engine(max_epochs=1, accelerator="gpu", devices=2)
This configuration should automatically distribute the training across two GPUs.
Implementation Goals
- Seamless integration with existing Anomalib APIs
- Minimal code changes required from users to enable multi-GPU training
- Proper utilization of PyTorch Lightning's multi-GPU capabilities
- Consistent performance improvements when using multiple GPUs
Implementation Steps
- Review PyTorch Lightning's multi-GPU implementation and best practices
- Modify the
Engineclass to properly handle multi-GPU configurations - Ensure all Anomalib models are compatible with distributed training
- Update data loading mechanisms to work efficiently with multiple GPUs
- Implement proper synchronization of metrics and logging across devices
- Add multi-GPU tests to the test suite
- Update documentation with multi-GPU usage instructions and best practices
Potential Challenges
- Ensuring all models in Anomalib are compatible with distributed training
- Handling model-specific operations that may not be distribution-friendly
- Managing different GPU memory capacities and load balancing
- Debugging training issues specific to multi-GPU setups
Discussion Points
- Should we support different distributed training strategies (DP, DDP, etc.)?
- How do we ensure reproducibility across single and multi-GPU training?
Next Steps
- [ ] Conduct a thorough review of PyTorch Lightning's multi-GPU capabilities
- [ ] Create a detailed technical design document for the implementation
- [ ] Implement a prototype with a single model and test performance gains
- [ ] Discuss potential impacts on existing features and user workflows
- [ ] Plan for gradual rollout, starting with a subset of models
Additional Considerations
- Performance benchmarking: single GPU vs multi-GPU for various models and datasets
- Impact on memory usage and potential optimizations
- Handling of model checkpointing and resuming training in multi-GPU setups
We welcome input from the community on this feature. Please share your thoughts, concerns, or suggestions regarding the implementation of multi-GPU support in Anomalib.
Hey guys, this is presumably one of the most important missing features in Anomalib. Do you have any ideas when v1.2 with multi-GPU training will be released?
Hi @haimat, I agree with you, but to enable multi-gpu, we had to go through a number of refactors here and there. You could check the PRs done to feature/design-simplifications branch.
What is left to enable multi-gpu is metric refactor and visualization refactor, which we are currently working on.
That sounds great, thanks for the update. Do you have an estimation on when you might be ready with this whole change?
@samet-akcay Hello, do you have any ideas when this might be released?
@haimat, we figured this requires quite some changes within AnomalyModule. Required changes unfortunately breaks the backwards compatibility, which is the reason why we decided to release this as part of v2.0. We are currently working on it on feature/design-simplifications branch, which will be released as v2.0.0
@samet-akcay Thanks for the update. Do you have an estimation, when you plan to release version 2.0?
we aim to release it by the end of this quarter
This has now been implemented here https://github.com/openvinotoolkit/anomalib/pull/2435