x-ray icon indicating copy to clipboard operation
x-ray copied to clipboard

Bad Cross Hatched Redaction fails

Open flooie opened this issue 1 year ago • 2 comments

bad_cross_hatched_redactions.pdf {}

Also fails to identify bad redactions

flooie avatar Apr 06 '25 21:04 flooie

Looks like your attachment didn't work?

mlissner avatar Apr 07 '25 19:04 mlissner

X‑ray appears to only recognize redactions when the overlapping area is a solid color—specifically, solid black, white, or (I guess) red. I noticed this over the weekend while inspecting redacted PDFs. The following screenshot shows an example:

Image

This behavior occurs because our current system uses a strict unicolor test (e.g. via pixmap.is_unicolor) to vet whether an area is redacted. In our tests, for example, bad_cross_hatched_redactions.pdf correctly recognizes 639 characters as being obscured, yet when we run a full redaction inspection via xray.inspect, no redactions are flagged. Cross-hatch patterns do not pass the strict unicolor test—even though they are clearly redacted—leading to false negatives.

Proposed Improvement:

Instead of relying solely on a strict unicolor check, we should introduce adaptive thresholding for our pixmaps. The idea is to process the pixmap for a given redaction area to convert it into a binary (black/white) image. By doing so, we would detect redactions that use cross-hatch patterns by converting them into a uniform color (e.g., all black). At the same time, legitimate text (like white text on a black background) should remain unaffected.

The goal is to quickly assess if the region contains a mix of colors that would indicate it is not uniformly redacted. Since performance is a primary concern, we could optimize further with NumPy in the future, but for now, keeping the dependencies unchanged, I propose a function that checks the brightness of pixels and stops scanning as soon as it finds both black and white pixels.

def multi_colored(pixmap, threshold=128) -> bool:
    """
    Check if a pixmap contains both black and white pixels.
    
    We interpret a pixel as "dark" if its brightness (the average of R, G, B)
    is below the given threshold, and "light" otherwise. This function iterates 
    over the pixel data and stops as soon as it finds a pixel whose classification 
    differs from the first pixel.
    
    :param pixmap: The fitz.Pixmap to check.
    :param threshold: The brightness threshold (0-255) for classifying pixels.
    :return: True if the pixmap contains both dark and light pixels, False otherwise.
    """
    data = pixmap.samples
    channels = pixmap.n

    # Determine the classification for the first pixel.
    first_brightness = (data[0] + data[1] + data[2]) // 3
    first_is_dark = first_brightness < threshold

    # Iterate over subsequent pixels.
    for i in range(channels, len(data), channels):
        brightness = (data[i] + data[i + 1] + data[i + 2]) // 3
        is_dark = brightness < threshold
        if is_dark != first_is_dark:
            return True  # Mixed colors detected.
    return False

Then all we need to do is switch out

    if not pixmap.is_unicolor:

for

    if multi_colored(pixmap, threshold=128):

This method quickly determines if the redaction area is likely (aka text) or a solid block including the cross-hatch patterns.

Performance: By stopping as soon as both dark and light pixels are detected, we reduce unnecessary processing. I think it shouldn't be tremendously slower.

Accuracy: This approach should bridge the gap between our current system and real-world redaction implementations, potentially allowing us to correctly flag more redactions in our database.

Perhaps we could find a good number more redactions.

@mlissner

flooie avatar Apr 09 '25 19:04 flooie