docling icon indicating copy to clipboard operation
docling copied to clipboard

Special character for degrees: °

Open sindre-sonat opened this issue 1 year ago • 2 comments

Bug

When processing an image file (.png) with "X degress Celsius", I get some unexpected behavior. The special character "°" is outputed as both "P", "9", "'", etc. More specifically, when the PDF file contains the information: "-40°C til +120°C", I get the output "-40PC til +1209C". ...

Steps to reproduce

from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat

pipeline_options = PdfPipelineOptions()

pipeline_options.ocr_options.lang = ["no"]

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)

path_file_image = <path to attached screen shot>
result = converter.convert(path_file_image)

result.document.export_to_markdown()

...

Docling version

Docling version: 2.7.0 Docling Core version: 2.4.1 Docling IBM Models version: 2.0.6 Docling Parse version: 2.1.0 ...

Python version

Python 3.11.10 ...

page_1_box_2

sindre-sonat avatar Nov 25 '24 11:11 sindre-sonat

Note that it works when I try with different web sites:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

result = converter.convert("https://www.britannica.com/technology/Celsius-temperature-scale")

result.document.export_to_markdown()

sindre-sonat avatar Nov 25 '24 11:11 sindre-sonat

@sindre-sonat Can you provide a PDF which exposes the problem? Thanks.

cau-git avatar May 21 '25 14:05 cau-git