docling icon indicating copy to clipboard operation
docling copied to clipboard

invalid string: control character U+0018

Open gadgetlabs opened this issue 1 year ago • 1 comments

Bug

Following exception raised RuntimeError: [json.exception.parse_error.101] parse error at line 1246, column 42: syntax error while parsing value - invalid string: control character U+0018 (CAN) must be escaped to \u0018; last read: '"- <U+0018>' relating to parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no).

Issue occurs due to docling-parse not handling the empty values when producing JSON, I think.

Example PDF can be found here (attached also) https://etc.usf.edu/lit2go/pdf/passage/348/the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf

the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf

Unclear whether to fail gracefully or ignore hence reporting as bug instead of fix.

gadgetlabs avatar Nov 26 '24 17:11 gadgetlabs

@gadgetlabs I can reproduce this error with the default docling settings. But I can successfully convert it by switching to docling-parse-v2 backend, see:

docling --pdf-backend=dlparse_v2 the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf

cau-git avatar Nov 27 '24 09:11 cau-git

solved here: https://github.com/DS4SD/docling-parse/pull/73

PeterStaar-IBM avatar Dec 10 '24 15:12 PeterStaar-IBM