markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

bug: docx not work

Open dev4mobile opened this issue 8 months ago • 7 comments

Image

dev4mobile avatar Apr 23 '25 08:04 dev4mobile

also can't parse doc file

Image

dev4mobile avatar Apr 23 '25 08:04 dev4mobile

I want to contribute to this, can I?

MohabASHRAF-byte avatar Apr 23 '25 08:04 MohabASHRAF-byte

I'm having same issue

nguyenson0904 avatar Apr 23 '25 09:04 nguyenson0904

Can you Provide Docs you Have tested with....

Abhijeet213 avatar Apr 23 '25 16:04 Abhijeet213

I just got markitdown and tried it with a file and got:

[rashino@archrailgun Downloads]$ markitdown Refined\ Homelab\ Service\ Metaplan_.docx 
Traceback (most recent call last):
  File "/usr/bin/markitdown", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/usr/lib/python3.13/site-packages/markitdown/__main__.py", line 197, in main
    result = markitdown.convert(
        args.filename, stream_info=stream_info, keep_data_uris=args.keep_data_uris
    )
  File "/usr/lib/python3.13/site-packages/markitdown/_markitdown.py", line 260, in convert
    return self.convert_local(source, stream_info=stream_info, **kwargs)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/site-packages/markitdown/_markitdown.py", line 311, in convert_local
    guesses = self._get_stream_info_guesses(
        file_stream=fh, base_guess=base_guess
    )
  File "/usr/lib/python3.13/site-packages/markitdown/_markitdown.py", line 675, in _get_stream_info_guesses
    result = self._magika.identify_stream(file_stream)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Magika' object has no attribute 'identify_stream'. Did you mean: 'identify_bytes'?
[rashino@archrailgun Downloads]$ 

TheNoteTaker avatar Apr 26 '25 06:04 TheNoteTaker

Work On This Doc - https://calibre-ebook.com/downloads/demos/demo.docx

Image

Abhijeet213 avatar Apr 26 '25 17:04 Abhijeet213

@dev4mobile - your error references w:ilvl which looks like a numbered list that is not proper defined.

cdm-arm avatar May 12 '25 09:05 cdm-arm

也无法解析 doc 文件

Image

so,how to fix it, i have a same problem

linnnff avatar Jun 03 '25 09:06 linnnff