docling
docling copied to clipboard
Convert Markdown document incorrect
Bug
Convert Markdown document error. ...
Steps to reproduce
Original content of the Markdown document is something like:
# ABCDEFG
- abc:
- abc123:
- abc1234:
- abc12345:
- a.
- b.
- abcd1234:
- abcd12345:
- a.
- b.
- def:
- def1234:
- def12345。
- ghijkl
Here's the convert process:
$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG
- abc:
- abc123:
- abc1234:
- abc12345:
- a.
- b.
- abcd1234:
- abcd12345:
- a.
- b.
- def:
- def1234:
- def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
<BlankLine children=[]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.
And here's the final result I got:
$ cat test2.md
# ABCDEFG
- abc:
- def:
- ghijkl
I also try to use python library to convert this document, but I still got same output.
In final result, a lot content is not been output, did I do anything wrong?
PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.
Docling version
$ docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0
Python version
$ python --version
Python 3.11.10