OMRChecker icon indicating copy to clipboard operation
OMRChecker copied to clipboard

refactor: move hardcoded preprocessing values to config and add config loader

Open Hasan-8326 opened this issue 1 month ago • 2 comments

User description

This PR removes hardcoded preprocessing values and replaces them with a centralized configuration system.

✅ Summary of Changes

Introduced a YAML-based configuration file: config/default_config.yaml

Added a new config loader module: config/config_loader.py

Updated core processing modules to use config values instead of hardcoded constants:

src/constants/image_processing.py

src/defaults/config.py

src/processors/CropPage.py

src/processors/CropOnMarkers.py

Added safe fallbacks to ensure the system works even if config keys are missing

Improved flexibility by allowing configuration overrides without modifying code

🧠 Why This Change?

Previously, threshold values, kernel sizes, and other preprocessing parameters were hardcoded across the codebase, making it difficult to:

Tune OMR performance for different scanning qualities

Adjust thresholds for different sheet templates

Adapt the system for diverse lighting or contrast environments

This PR solves that by enabling external config-driven control, improving maintainability and user flexibility.


PR Type

Enhancement, Refactoring


Description

  • Introduced YAML-based configuration system with centralized config loader

  • Migrated hardcoded preprocessing values to external config file

  • Updated image processing modules to use config values with fallbacks

  • Simplified CropPage and CropOnMarkers classes to use config-driven parameters


Diagram Walkthrough

flowchart LR
  A["config/default_config.yaml"] -->|load_config| B["config/config_loader.py"]
  B -->|provides values| C["src/constants/image_processing.py"]
  C -->|used by| D["src/processors/CropPage.py"]
  C -->|used by| E["src/processors/CropOnMarkers.py"]
  B -->|provides values| F["src/defaults/config.py"]

File Walkthrough

Relevant files
Enhancement
config_loader.py
New config loader module for YAML parsing                               

config/config_loader.py

  • New module that loads YAML configuration files
  • Implements safe file existence check with error handling
  • Returns parsed YAML as dictionary for use across codebase
+13/-0   
image_processing.py
Migrate constants to config-driven values                               

src/constants/image_processing.py

  • Integrated config loader to read preprocessing values from YAML
  • Replaced hardcoded threshold values with config-driven values
  • Updated MIN_PAGE_AREA_THRESHOLD, MAX_COSINE_THRESHOLD,
    PAGE_THRESHOLD_PARAMS
  • Updated CANNY_PARAMS, DEFAULT_MAX_FEATURES, DEFAULT_GOOD_MATCH_PERCENT
    with config fallbacks
+18/-9   
Configuration changes
default_config.yaml
New centralized YAML configuration file                                   

config/default_config.yaml

  • New YAML configuration file with preprocessing parameters
  • Defines thresholds, kernel sizes, and image processing values
  • Includes paths, logging, OMR, and CLI configuration sections
  • Replaces hardcoded constants with externally configurable values
+28/-0   
Refactoring
config.py
Refactor defaults config with YAML integration                     

src/defaults/config.py

  • Replaced DotMap-based configuration with dictionary-based approach
  • Integrated config loader to read from YAML file
  • Restructured DEFAULT_CONFIG to include threshold_params, paths, omr,
    logging, and cli sections
  • Added fallback values for all config keys to ensure robustness
+36/-33 
CropOnMarkers.py
Refactor CropOnMarkers with config integration                     

src/processors/CropOnMarkers.py

  • Simplified class structure by removing inheritance from
    ImagePreprocessor
  • Integrated config loader for marker matching and preprocessing
    parameters
  • Refactored to use config-based threshold values and blur kernel
    settings
  • Added new methods for image preprocessing, marker matching, and
    threshold computation
  • Removed complex marker loading and template matching logic
+118/-240
CropPage.py
Refactor CropPage with config-driven parameters                   

src/processors/CropPage.py

  • Simplified class structure by removing inheritance from
    ImagePreprocessor
  • Integrated config loader for threshold and Canny edge detection
    parameters
  • Refactored to use config-based values with fallbacks to constants
  • Added dedicated methods for threshold application, edge detection, and
    contour validation
  • Removed complex image utility dependencies and logging calls
+63/-110

Hasan-8326 avatar Nov 06 '25 16:11 Hasan-8326

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🔴
Unhandled config loading failure

Description: The config is loaded at module import time without error handling, which will cause the
entire application to crash if the config file is missing or malformed, creating a denial
of service vulnerability.
image_processing.py [8-8]

Referred Code
config = load_config()

Module-level config loading crash

Description: The config is loaded at module import time in a processor class without error handling,
causing application crashes if the config file is unavailable during import, leading to
denial of service.
CropOnMarkers.py [5-5]

Referred Code
config = load_config()

Import-time config failure

Description: The config is loaded at module import time without error handling, causing the application
to fail during import if the config file is missing or corrupted, resulting in denial of
service.
CropPage.py [7-7]

Referred Code
config = load_config()

Unsafe YAML deserialization

Description: The YAML file is loaded without any validation or sanitization, allowing arbitrary Python
code execution through YAML deserialization attacks if an attacker can control the config
file contents.
config_loader.py [12-13]

Referred Code
with open(config_path, "r") as file:
    return yaml.safe_load(file)
Ticket Compliance
🎫 No ticket provided
  • [ ] Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Missing YAML error handling: The yaml.safe_load() call does not handle potential YAML parsing errors, which could cause
the application to crash with unclear error messages if the config file is malformed.

Referred Code
with open(config_path, "r") as file:
    return yaml.safe_load(file)
Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status:
Path exposure in error: The FileNotFoundError at line 10 exposes the full file system path in the error message,
which could reveal internal directory structure to end users.

Referred Code
if not os.path.exists(config_path):
    raise FileNotFoundError(f"Config file not found at {config_path}")
Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Missing path validation: The config_path parameter is not validated for path traversal attacks, allowing potential
access to arbitrary files outside the intended config directory.

Referred Code
def load_config(config_path="config/default_config.yaml"):
    """
    Load YAML config file and return it as a dictionary.
    Raises an error if the config file is missing.
    """
    if not os.path.exists(config_path):
        raise FileNotFoundError(f"Config file not found at {config_path}")
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
PR scope exceeds configuration refactoring

The PR contains a major, undocumented rewrite of the CropPage and CropOnMarkers
processors. This includes removing their ImagePreprocessor base class, which is
a breaking architectural change that should be in a separate PR.

Examples:

src/processors/CropOnMarkers.py [14-131]
class CropOnMarkers:
    def __init__(self, marker_ops=None):
        # Load config or fallback to defaults
        self.min_matching_threshold = marker_ops.get(
            "min_matching_threshold",
            config["omr"].get("bubble_threshold", 0.3)
        ) if marker_ops else config["omr"].get("bubble_threshold", 0.3)

        self.gaussian_blur_kernel = config["preprocessing"].get(
            "blur_kernel",

 ... (clipped 108 lines)
src/processors/CropPage.py [17-76]
class CropPage:
    def __init__(self):
        # Load values from config with fallback to constants
        self.min_page_area_threshold = config["preprocessing"].get("min_page_area_threshold", MIN_PAGE_AREA_THRESHOLD)
        self.max_cosine_threshold = config["preprocessing"].get("max_cosine_threshold", MAX_COSINE_THRESHOLD)

        self.page_threshold_params = {
            "thresh": config["preprocessing"].get("threshold", PAGE_THRESHOLD_PARAMS["threshold_value"]),
            "maxval": PAGE_THRESHOLD_PARAMS["max_pixel_value"],
        }

 ... (clipped 50 lines)

Solution Walkthrough:

Before:

# src/processors/CropOnMarkers.py
class CropOnMarkers(ImagePreprocessor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        ...

    def apply_filter(self, image, file_path):
        ...
        # Main processing logic is here
        return image

# src/processors/CropPage.py
class CropPage(ImagePreprocessor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        ...

    def apply_filter(self, image, file_path):
        # Main processing logic is here
        return image

After:

# src/processors/CropOnMarkers.py
class CropOnMarkers:  # No longer inherits from ImagePreprocessor
    def __init__(self, marker_ops=None):
        # New initialization logic
        ...

    def process(self, image, reference_markers):
        # Completely new processing logic
        ...
        return { ... }

# src/processors/CropPage.py
class CropPage:  # No longer inherits from ImagePreprocessor
    def __init__(self):
        # New initialization logic
        ...
    # Note: The main processing method like `apply_filter` is removed
    # and not replaced with a clear entry point.

Suggestion importance[1-10]: 10

__

Why: This suggestion correctly identifies a critical issue where the PR's scope is much larger than described, including an undocumented rewrite and a breaking architectural change by removing the ImagePreprocessor base class from CropPage and CropOnMarkers.

High
Possible issue
Fix incorrect contour validation logic

Fix a bug in is_valid_contour by returning False if a contour does not
approximate to a quadrilateral (len(approx) != 4), preventing incorrect
validation.

src/processors/CropPage.py [55-76]

 def is_valid_contour(self, contour):
     """ Validate contour based on config thresholds """
     if cv2.contourArea(contour) < self.min_page_area_threshold:
         return False
 
     # Calculate max cosine for contour
     peri = cv2.arcLength(contour, True)
     approx = cv2.approxPolyDP(contour, 0.025 * peri, True)
 
+    if len(approx) != 4:
+        return False
+
     max_cosine = 0
-    if len(approx) == 4:
-        for i in range(4):
-            p1 = approx[i][0]
-            p2 = approx[(i + 1) % 4][0]
-            p3 = approx[(i + 2) % 4][0]
+    for i in range(4):
+        p1 = approx[i][0]
+        p2 = approx[(i + 1) % 4][0]
+        p3 = approx[(i + 2) % 4][0]
 
-            v1 = p1 - p2
-            v2 = p3 - p2
-            cosine = abs(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))
-            max_cosine = max(max_cosine, cosine)
+        v1 = p1 - p2
+        v2 = p3 - p2
+        # Add a small epsilon to avoid division by zero for zero-length vectors
+        norm_v1 = np.linalg.norm(v1)
+        norm_v2 = np.linalg.norm(v2)
+        if norm_v1 == 0 or norm_v2 == 0:
+            continue # or handle as an invalid angle
+        cosine = abs(np.dot(v1, v2) / (norm_v1 * norm_v2))
+        max_cosine = max(max_cosine, cosine)
 
     return max_cosine < self.max_cosine_threshold
  • [ ] Apply / Chat
Suggestion importance[1-10]: 9

__

Why: The suggestion identifies a critical bug where non-quadrilateral contours are incorrectly validated as True, which would lead to incorrect page detection. The proposed fix is correct and essential for the function's logic.

High
Use absolute path for config file

To prevent FileNotFoundError, modify load_config to use an absolute path for the
default configuration file, constructed relative to the script's own location.

config/config_loader.py [4-13]

-def load_config(config_path="config/default_config.yaml"):
+def load_config(config_path=None):
     """
     Load YAML config file and return it as a dictionary.
     Raises an error if the config file is missing.
     """
+    if config_path is None:
+        # Construct path relative to this file's location
+        base_dir = os.path.dirname(os.path.abspath(__file__))
+        config_path = os.path.join(base_dir, "default_config.yaml")
+
     if not os.path.exists(config_path):
         raise FileNotFoundError(f"Config file not found at {config_path}")
 
     with open(config_path, "r") as file:
         return yaml.safe_load(file)
  • [ ] Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential FileNotFoundError due to the use of a relative path and proposes a robust fix by creating an absolute path, significantly improving the function's reliability.

Medium
General
Handle mismatched marker image sizes

Enhance calculate_marker_similarity to handle markers of different sizes by
resizing one image to match the other before comparison, and simplify the score
calculation using np.mean().

src/processors/CropOnMarkers.py [63-74]

 def calculate_marker_similarity(self, marker1, marker2):
     """ Calculate similarity between markers (dummy implementation to be improved) """
     # Convert to grayscale for comparison
     m1 = cv2.cvtColor(marker1, cv2.COLOR_BGR2GRAY) if len(marker1.shape) == 3 else marker1
     m2 = cv2.cvtColor(marker2, cv2.COLOR_BGR2GRAY) if len(marker2.shape) == 3 else marker2
 
     if m1.shape != m2.shape:
-        return 0
+        # Resize m2 to match m1's shape for comparison
+        m2 = cv2.resize(m2, (m1.shape[1], m1.shape[0]), interpolation=cv2.INTER_AREA)
 
     diff = cv2.absdiff(m1, m2)
-    score = 1 - (np.sum(diff) / (m1.size * 255))
+    # The score is 1 minus the average difference per pixel
+    score = 1 - (np.mean(diff) / 255.0)
     return max(0, min(1, score))  # Clamp between 0 and 1
  • [ ] Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion improves robustness by handling markers of different sizes through resizing, which is a reasonable enhancement. The simplification using np.mean() is a minor but good code quality improvement.

Low
  • [ ] More