Add --validate-images CLI option to filter corrupt images using PIL
Adds a new --validate-images CLI option that enables PIL-based image validation to filter out corrupt or invalid image files during processing.
Problem
When processing image datasets, users may encounter corrupt or invalid image files that cause processing to fail. Currently, zamba only checks if files exist and have non-zero size, but doesn't validate that they are actually valid images that can be opened and processed.
Solution
This PR adds a new CLI option --validate-images that:
- Attempts to open each image file with PIL (Python Imaging Library)
- Filters out images that cannot be opened or decoded
- Logs appropriate warning messages about filtered files
- Continues processing with only valid images
Usage
Command Line Interface
For image prediction:
zamba image predict --data-dir /path/to/images --validate-images
For image training:
zamba image train --data-dir /path/to/images --labels /path/to/labels.csv --validate-images
Python API
from zamba.images.config import ImageClassificationPredictConfig
config = ImageClassificationPredictConfig(
data_dir="/path/to/images",
validate_images=True
)
Implementation Details
-
Backward Compatible: Feature is disabled by default (
validate_images=False) - Comprehensive Logging: Distinguishes between file existence failures and PIL validation failures
- Efficient Processing: Uses parallel processing for training validation
- Robust Error Handling: Gracefully handles all PIL-related exceptions
Changes Made
-
CLI Enhancement: Added
--validate-imagesoption to bothpredictandtraincommands -
Configuration: Added
validate_images: bool = Falseparameter to both config classes -
Validation Logic: Implemented
_validate_filepath_with_pil()function using PIL - Integration: Enhanced existing validation methods to use PIL when enabled
- Logging: Added specific messages for PIL validation failures
- Tests: Comprehensive test suite covering all functionality
- Documentation: Added detailed usage examples and documentation
Example Output
With validation enabled, users will see:
INFO | Validating image files exist and can be opened with PIL
WARNING | 2 files in provided labels file do not exist on disk or cannot be opened with PIL; ignoring those files. Example: ['corrupt_image.jpg', 'invalid_file.jpg']...
This feature is particularly useful when working with datasets from external sources or when data integrity is uncertain.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.