docling
docling copied to clipboard
Special character for degrees: °
Bug
When processing an image file (.png) with "X degress Celsius", I get some unexpected behavior. The special character "°" is outputed as both "P", "9", "'", etc. More specifically, when the PDF file contains the information: "-40°C til +120°C", I get the output "-40PC til +1209C". ...
Steps to reproduce
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
pipeline_options = PdfPipelineOptions()
pipeline_options.ocr_options.lang = ["no"]
converter = DocumentConverter(
format_options={
InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
}
)
path_file_image = <path to attached screen shot>
result = converter.convert(path_file_image)
result.document.export_to_markdown()
...
Docling version
Docling version: 2.7.0 Docling Core version: 2.4.1 Docling IBM Models version: 2.0.6 Docling Parse version: 2.1.0 ...
Python version
Python 3.11.10 ...
Note that it works when I try with different web sites:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("https://www.britannica.com/technology/Celsius-temperature-scale")
result.document.export_to_markdown()
@sindre-sonat Can you provide a PDF which exposes the problem? Thanks.