unstructured bug/Cannot partition doc files with multi-byte names

Describe the bug When calling unstructured.partition.doc.partition_doc with a doc file with multi-byte name (I checked: 文章.doc/风格.doc), it fails with an error.

To Reproduce

Create empty *.doc file using Word, and name it 文章.doc or 风格.doc. I attached that files below: empty_docs.zip
Run the code below:

from unstructured.partition.doc import partition_doc

doc_path = "文章.doc"
# doc_path = "风格.doc"
elements = partition_doc(filename=doc_path)

It will throw:

Traceback (most recent call last):
  File "<home>\utf8error.py", line 4, in <module>
    elements = partition_doc(filename=doc_path)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<python>\site-packages\unstructured\documents\elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "<python>\site-packages\unstructured\file_utils\filetype.py", line 731, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "<python>\site-packages\unstructured\file_utils\filetype.py", line 687, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "<python>\site-packages\unstructured\chunking\dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File "<python>\site-packages\unstructured\partition\doc.py", line 89, in partition_doc
    convert_office_doc(
  File "<python>\site-packages\unstructured\partition\common.py", line 429, in convert_office_doc
    message = output.stdout.decode().strip()
              ^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8a in position 32: invalid start byte

(I replaced private directory with <home>. Also replaced directory Python installed with <python>)

Expected behavior The code should not throw an exception.

Screenshots Environment Info

Windows 11 Home
Python 3.11.9
unstructured: 0.15.13

Additional context English(alphabet only) filename didn't cause an exception to be thrown.

Sep 21 '24 15:09 Snowman-s

@Snowman-s I don't believe this is directly related to the filename. You can test that theory by making a copy of the file and changing its name to something like document.doc. I believe you'll see the same results.

You can see where this error occurs, the code is capturing the soffice command output for logging purposes: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/common/common.py#L309

Are you running on Windows? There are some possible problems with the encoding of the terminal output not being utf-8.

Sep 21 '24 18:09 scanny

@scanny Thanks for reply!

I don't believe this is directly related to the filename. You can test that theory by making a copy of the file and changing its name to something like document.doc. I believe you'll see the same results.

I have re-tested the English filename and reconfirmed that it does not produce any errors. During that verification, I noticed that it fails not only if the file name contains multi-byte characters, but also if the path to the file contains multi-byte characters.

Are you running on Windows?

Yes. I'm using Windows 11.

Sep 22 '24 00:09 Snowman-s

Ahh, interesting. That leads me to believe that the input filename is echoed on stdout somewhere and that's where it's failing (and why it's not failing until we try to read stdout). In any case, the encoding used on stdout on your machine is not utf-8 it appears.

Some useful detail on the underlying problem here: https://github.com/python/cpython/issues/105312

@Snowman-s what happens if you set PYTHONENCODING=utf-8 before running your code? https://stackoverflow.com/a/7865013/1902513

Sep 22 '24 19:09 scanny

Engineering note: one plausible solution to this is to avoid attempts to decode the captured stdout bytes and simply use str(output.stdout) instead. Rationale:

This should work for all encodings.
For the common case, the output will be predominantly ASCII characters and will still be readable.
This output is for logging purposes, so perhaps perfect rendering is not absolutely required.
Could also take this approach as a fallback in a try/except UnicodeDecodeError block.
Avoids an approach with broad potential impact like changing all IO encoding to utf-8 for the entire running Python process and subprocesses.

Sep 22 '24 19:09 scanny

@scanny

@Snowman-s what happens if you set PYTHONENCODING=utf-8 before running your code? https://stackoverflow.com/a/7865013/1902513

I ran $env:PYTHONIOENCODING="utf-8:surrogateescape"; python <code>.py, and it still throws the same exception. Same for sys.stdout.reconfigure(encoding='utf-8').

I've forgotten about it until now, but here's the version of soffice. (It is Windows 64 bit version.)

> soffice --version
LibreOffice 24.8.1.2 87fa9aec1a63e70835390b81c40bb8993f1d4ff6

Sep 23 '24 11:09 Snowman-s

@Snowman-s that's good to know. That narrows down the possible solutions.

Engineering note: I think this rules out us being able to affect how LibreOffice encodes messages it writes to stdout. The options I can think of are these:

Use a try/except block as I mentioned above, falling back to str(bytes_from_stdout) on UnicodeDecodeError which would be mostly readable.
Detect Windows and use the locale encoding to decode stdout bytes in that case.
Use a try/except and use chardet as a backup to auto-detect encoding.

My vote is option 1 since this is for logging, not for UI.

Sep 23 '24 18:09 scanny