zamba icon indicating copy to clipboard operation
zamba copied to clipboard

Add --validate-images CLI option to filter corrupt images using PIL

Open Copilot opened this issue 7 months ago • 0 comments

Adds a new --validate-images CLI option that enables PIL-based image validation to filter out corrupt or invalid image files during processing.

Problem

When processing image datasets, users may encounter corrupt or invalid image files that cause processing to fail. Currently, zamba only checks if files exist and have non-zero size, but doesn't validate that they are actually valid images that can be opened and processed.

Solution

This PR adds a new CLI option --validate-images that:

  • Attempts to open each image file with PIL (Python Imaging Library)
  • Filters out images that cannot be opened or decoded
  • Logs appropriate warning messages about filtered files
  • Continues processing with only valid images

Usage

Command Line Interface

For image prediction:

zamba image predict --data-dir /path/to/images --validate-images

For image training:

zamba image train --data-dir /path/to/images --labels /path/to/labels.csv --validate-images

Python API

from zamba.images.config import ImageClassificationPredictConfig

config = ImageClassificationPredictConfig(
    data_dir="/path/to/images",
    validate_images=True
)

Implementation Details

  • Backward Compatible: Feature is disabled by default (validate_images=False)
  • Comprehensive Logging: Distinguishes between file existence failures and PIL validation failures
  • Efficient Processing: Uses parallel processing for training validation
  • Robust Error Handling: Gracefully handles all PIL-related exceptions

Changes Made

  1. CLI Enhancement: Added --validate-images option to both predict and train commands
  2. Configuration: Added validate_images: bool = False parameter to both config classes
  3. Validation Logic: Implemented _validate_filepath_with_pil() function using PIL
  4. Integration: Enhanced existing validation methods to use PIL when enabled
  5. Logging: Added specific messages for PIL validation failures
  6. Tests: Comprehensive test suite covering all functionality
  7. Documentation: Added detailed usage examples and documentation

Example Output

With validation enabled, users will see:

INFO     | Validating image files exist and can be opened with PIL
WARNING  | 2 files in provided labels file do not exist on disk or cannot be opened with PIL; ignoring those files. Example: ['corrupt_image.jpg', 'invalid_file.jpg']...

This feature is particularly useful when working with datasets from external sources or when data integrity is uncertain.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot avatar Jul 08 '25 20:07 Copilot