borb
borb copied to clipboard
BUG DCTDecode Filter throws error
Describe the bug A clear and concise description of what the bug is.
To Reproduce I created a script for OCR-ing pdfs based on your blog. Very excited by this library but yet to have it work for me. THis time it throws this exception when trying to OCR this pdf file -- https://ppc.files.com/f/64bfd2745c276cb6
AssertionError: Unknown /Filter DCTDecode
python-BaseException
This was an issue for Py2PDF too: https://stackoverflow.com/questions/47730944/error-while-image-extraction-from-pdf-in-python
from pathlib import Path
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from borb.pdf.pdf import PDF
class OCRMachine:
def __init__(self, tesseract_data_path):
self.tesseract_data_dir = Path(tesseract_data_path)
def build_ocr_pdf_copy(self, input_pdf_path, copy_destination_path):
# Set up everything for OCR
assert self.tesseract_data_dir.exists()
l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(self.tesseract_data_dir)
# Read Document
doc = None
with open(input_pdf_path, "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
assert doc is not None
# Store Document
with open(copy_destination_path, "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
scanned_pdf_example = r"C:\Users\adankert\Downloads\tess_example.pdf"
example_output_path = r"C:\Users\adankert\Downloads\ocr_example.pdf"
ocr_machine = OCRMachine(tesseract_data_path=r"C:\Program Files\Tesseract-OCR\tessdata")
ocr_machine.build_ocr_pdf_copy(input_pdf_path=scanned_pdf_example,
copy_destination_path=example_output_path)
Expected behaviour
**Screenshots**
```
Traceback (most recent call last):
File "C:/Users/adankert/Google Drive/GitHub/file_server_management/ocr/ocr_pdf.py", line 21, in build_ocr_pdf_copy
doc = PDF.loads(pdf_file_handle, [l])
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\pdf\pdf.py", line 54, in loads
return ReadAnyObjectTransformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\xref_transformer.py", line 140, in transform
trailer = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
v = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
transformed_referenced_object = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\page\root_dictionary_transformer.py", line 86, in transform
transformed_root_dictionary = t.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
v = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
transformed_referenced_object = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
v = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\array_transformer.py", line 48, in transform
object_to_transform[i] = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
transformed_referenced_object = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\page\page_dictionary_transformer.py", line 64, in transform
v = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
v = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\object\dictionary_transformer.py", line 48, in transform
v = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\reference\reference_transformer.py", line 105, in transform
transformed_referenced_object = self.get_root_transformer().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\any_object_transformer.py", line 100, in transform
return super().transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\transformer.py", line 123, in transform
out = h.transform(
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\read\image\grayscale_image_transformer.py", line 76, in transform
x for x in decode_stream(object_to_transform)["DecodedBytes"]
File "C:\Users\adankert\.virtualenvs\file_server_management-Bevy2CbZ\lib\site-packages\borb\io\filter\stream_decode_util.py", line 78, in decode_stream
assert False, "Unknown /Filter %s" % filter_name
AssertionError: Unknown /Filter DCTDecode
python-BaseException
```
**Desktop (please complete the following information):**
- OS: Windows 10
- borb version 2.0.31
- Python 3.9.11
- input PDF[ (if applicable)](https://ppc.files.com/f/64bfd2745c276cb6)
Vera PDF (online validator) throws the following errors with regards to your PDF:
Specification: ISO 19005-1:2005, Clause: 6.7.11, Test number: 1 The PDF/A version and conformance level of a file shall be specified using the PDF/A Identification extension schema. Failed 1 occurrences Show
Specification: ISO 19005-1:2005, Clause: 6.2.3, Test number: 4 If an uncalibrated colour space is used in a file then that file shall contain a PDF/A-1 OutputIntent, as defined in 6.2.2 Failed 5 occurrences
Can you provide me with a valid PDF for which your code fails?
Kind regards, Joris Schellekens
What about this one: https://ppc.files.com/f/278550e1316ca54f
I can't tell if they are valid (would be a nice feature for borb, especially if certain functionality cannot handle non-valid pdf files)
Checking whether a PDF is valid or not is not an easy feature. Usually, PDF libraries try to accept input even if it isn't all that valid.
You can use https://demo.verapdf.org/ to verify whether a PDF is valid. I'll give your ticket some more attention this weekend.
Kind regards, Joris Schellekens
The second file (regardless of whether it is valid or not) seems to throw this error because it forces borb
to parse a grayscale image using FlateDecode
and DCTDecode
. When I implemented borb
, I thought (mistakingly perhaps) that DCTDecode
would only be used with JPEG images.
This is a known limitation of borb
.
Now I know the limitation.