docling icon indicating copy to clipboard operation
docling copied to clipboard

Convert Markdown document incorrect

Open kime541200 opened this issue 2 months ago • 3 comments

Bug

Convert Markdown document error. ...

Steps to reproduce

Original content of the Markdown document is something like:

# ABCDEFG
- abc:
	- abc123:
		- abc1234:
			- abc12345:
				- a.
				- b.
		- abcd1234:
			- abcd12345:
				- a.
				- b.
- def:
	- def1234:
		- def12345。
- ghijkl

Here's the convert process:

$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG

- abc:
  - abc123:
    - abc1234:
      - abc12345:
        - a.
        - b.
      - abcd1234:
        - abcd12345:
          - a.
          - b.
- def:
  - def1234:
    - def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
 <BlankLine children=[]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.

And here's the final result I got:

$ cat test2.md
# ABCDEFG

- abc:
- def:
- ghijkl

I also try to use python library to convert this document, but I still got same output.

In final result, a lot content is not been output, did I do anything wrong?

PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.

Docling version

$ docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

$ python --version
Python 3.11.10

kime541200 avatar Dec 18 '24 13:12 kime541200