borb icon indicating copy to clipboard operation
borb copied to clipboard

BUG DCTDecode Filter throws error

Open aa-dank opened this issue 1 year ago • 4 comments

Describe the bug A clear and concise description of what the bug is.

To Reproduce I created a script for OCR-ing pdfs based on your blog. Very excited by this library but yet to have it work for me. THis time it throws this exception when trying to OCR this pdf file -- https://ppc.files.com/f/64bfd2745c276cb6

AssertionError: Unknown /Filter DCTDecode
python-BaseException

This was an issue for Py2PDF too: https://stackoverflow.com/questions/47730944/error-while-image-extraction-from-pdf-in-python


from pathlib import Path

from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from borb.pdf.pdf import PDF


class OCRMachine:
    def __init__(self, tesseract_data_path):
        self.tesseract_data_dir = Path(tesseract_data_path)

    def build_ocr_pdf_copy(self, input_pdf_path, copy_destination_path):

        # Set up everything for OCR
        assert self.tesseract_data_dir.exists()
        l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(self.tesseract_data_dir)

        # Read Document
        doc = None
        with open(input_pdf_path, "rb") as pdf_file_handle:
            doc = PDF.loads(pdf_file_handle, [l])

        assert doc is not None

        # Store Document
        with open(copy_destination_path, "wb") as pdf_file_handle:
            PDF.dumps(pdf_file_handle, doc)

scanned_pdf_example = r"C:\Users\adankert\Downloads\tess_example.pdf"
example_output_path = r"C:\Users\adankert\Downloads\ocr_example.pdf"
ocr_machine = OCRMachine(tesseract_data_path=r"C:\Program Files\Tesseract-OCR\tessdata")
ocr_machine.build_ocr_pdf_copy(input_pdf_path=scanned_pdf_example,
                               copy_destination_path=example_output_path)

Expected behaviour


**Screenshots**
```
Traceback (most recent call last):
  File "C:/Users/adankert/Google Drive/GitHub/file_server_management/ocr/ocr_pdf.py", line 21, in build_ocr_pdf_copy
    doc = PDF.loads(pdf_file_handle, [l])
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\pdf\pdf.py", line 54, in loads
    return ReadAnyObjectTransformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\xref_transformer.py", line 140, in transform
    trailer = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
    v = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\page\root_dictionary_transformer.py", line 86, in transform
    transformed_root_dictionary = t.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
    v = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
    v = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\array_transformer.py", line 48, in transform
    object_to_transform[i] = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\page\page_dictionary_transformer.py", line 64, in transform
    v = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
    v = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
    v = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
    transformed_referenced_object = self.get_root_transformer().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
    return super().transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
    out = h.transform(
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\image\grayscale_image_transformer.py", line 76, in transform
    x for x in decode_stream(object_to_transform)["DecodedBytes"]
  File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\filter\stream_decode_util.py", line 78, in decode_stream
    assert False, "Unknown /Filter %s" % filter_name
AssertionError: Unknown /Filter DCTDecode
python-BaseException
```

**Desktop (please complete the following information):**
 - OS: Windows 10
 - borb version 2.0.31
 - Python 3.9.11
- input PDF[ (if applicable)](https://ppc.files.com/f/64bfd2745c276cb6)

aa-dank avatar Aug 02 '22 17:08 aa-dank

Vera PDF (online validator) throws the following errors with regards to your PDF:

Specification: ISO 19005-1:2005, Clause: 6.7.11, Test number: 1 The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema. Failed 1 occurrences Show

Specification: ISO 19005-1:2005, Clause: 6.2.3, Test number: 4 If an uncalibrated colour space is used in a file then that file shall contain a PDF/A-1 OutputIntent, as defined in 6.2.2 Failed 5 occurrences

Can you provide me with a valid PDF for which your code fails?

Kind regards, Joris Schellekens

jorisschellekens avatar Aug 02 '22 20:08 jorisschellekens

What about this one: https://ppc.files.com/f/278550e1316ca54f

I can't tell if they are valid (would be a nice feature for borb, especially if certain functionality cannot handle non-valid pdf files)

aa-dank avatar Aug 03 '22 17:08 aa-dank

Checking whether a PDF is valid or not is not an easy feature. Usually, PDF libraries try to accept input even if it isn't all that valid.

You can use https://demo.verapdf.org/ to verify whether a PDF is valid. I'll give your ticket some more attention this weekend.

Kind regards, Joris Schellekens

jorisschellekens avatar Aug 03 '22 18:08 jorisschellekens

The second file (regardless of whether it is valid or not) seems to throw this error because it forces borb to parse a grayscale image using FlateDecode and DCTDecode. When I implemented borb, I thought (mistakingly perhaps) that DCTDecode would only be used with JPEG images.

This is a known limitation of borb.

jorisschellekens avatar Aug 08 '22 19:08 jorisschellekens

Now I know the limitation.

aa-dank avatar Jan 06 '23 22:01 aa-dank