pdf转换中文内容报错
Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "C:\Users\cs-hg-293\AppData\Local\Programs\Python\Python312\Scripts\markitdown.exe\__main__.py", line 7, in <module> File "C:\Users\cs-hg-293\AppData\Local\Programs\Python\Python312\Lib\site-packages\markitdown\__main__.py", line 43, in main print(result.text_content) UnicodeEncodeError: 'gbk' codec can't encode character '\xae' in position 62: illegal multibyte sequence
Seems like the bug is related to what terminal you are using, and I successfully reproduced the bug. If you are using Powershell or CMD on Windows, the default text encoding is not UTF-8, and it cannot handle specific Chinese characters(such as: ®) and emoji properly, this causes encoding mismatches when writing to the file. In this case, you can execult the following command:
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
and try again.
Chinese Translated: 看起来这个问题与您使用的终端有关,我成功地重现了这个问题。如果您在 Windows 上使用 PowerShell 或 CMD,默认的文本编码不是 UTF-8,它无法正确处理某些中文字符和 emoji(比如 ®),因为您使用了重定向操作符 >,它无法处理超出其支持字符集的字符。在这种情况下,您可以执行以下命令:
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
然后再尝试一下
个人建议还是从代码层面是解决会更加靠谱一些。因为本身cmd、powershell 无法设置编码。不可能让每个用户都去修改一遍。甚至没办法修改
I personally suggest that solving it from the code level would be more reliable. Because cmd and PowerShell cannot set encoding by themselves. It's not feasible to have every user modify it individually. In fact, it's not even possible to modify it.
The encoding layer cannot fully accommodate this case, or this should be something to pay attention to when using it yourself
Hi @liguobao , I created a PR that can solve these encoding issues, waiting for reviews ...
似乎该错误与您使用的终端有关,我成功重现了该错误。如果你在 Windows 上使用 Powershell 或 CMD,默认的文本编码不是 UTF-8,并且无法正确处理特定的汉字(如:®)和 emoji,这会导致写入文件时出现编码不匹配。在这种情况下,您可以执行以下命令:
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8 然后重试。
Chinese Translationd: 看起来这个问题与您使用的终端有关,我成功地重现了这个问题。如果您在 Windows 上使用 PowerShell 或 CMD,默认的文本编码不是 UTF-8,它无法正确处理某些中文字符和 emoji(比如 ® ),因为您使用了重定向操作符 >,它无法处理超出其支持字符集的字符。在这种情况下,您可以执行以下命令:
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8 然后再尝试一下
Seems like the bug is related to what terminal you are using, and I successfully reproduced the bug. If you are using Powershell or CMD on Windows, the default text encoding is not UTF-8, and it cannot handle specific Chinese characters(such as: ®) and emoji properly, this causes encoding mismatches when writing to the file. In this case, you can execult the following command:
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8 and try again.
Chinese Translated: 看起来这个问题与您使用的终端有关,我成功地重现了这个问题。如果您在 Windows 上使用 PowerShell 或 CMD,默认的文本编码不是 UTF-8,它无法正确处理某些中文字符和 emoji(比如 ®),因为您使用了重定向操作符 >,它无法处理超出其支持字符集的字符。在这种情况下,您可以执行以下命令:
[System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8 然后再尝试一下
I used the command you gave, but it didn't work.
Use -o instead of > and the output will be normal.
貌似不用>,用-o就可以。
我用txt和docx格式转换成markitdown就可以,pdf和jpg,png就不行,不报错,但是里面没东西