markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence

Open ZhuPingFei opened this issue 1 year ago • 4 comments

UnicodeEncodeError: 'gbk' codec can't encode character '\u2009' in position 390: illegal multibyte sequence

ZhuPingFei avatar Dec 28 '24 18:12 ZhuPingFei

same issue

codeicu avatar Dec 31 '24 01:12 codeicu

It seems that Issue https://github.com/microsoft/markitdown/issues/198 has solved this problem. If you still encouner this problem, below is my solution:

  1. Write a .py file (let's say it is converter.py)
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("your-file-name")
print(result.text_content)
  1. move the output text into a markdown file.
$ python convertor.py > your-file-name.md

That's worked for me.

SEU-zxj avatar Dec 31 '24 08:12 SEU-zxj

If above code do not work, you can try below:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("your-file-name.pptx")
text = result.text_content
    
# Save to a file
with open("your-file-name.md", "w", encoding="utf-8") as f:
    f.write(text)

# Print to the console
# print(text)

SEU-zxj avatar Dec 31 '24 08:12 SEU-zxj

This works for me:

markitdown < 1.pdf

codeicu avatar Dec 31 '24 08:12 codeicu