unstructured
unstructured copied to clipboard
UnicodeDecodeError in logger.info during the execution of partition_doc
hi, I'm using version 0.11.8. I use the following code to execute partition_doc :
from unstructured.partition.doc import partition_doc
filename = ""D:\\Testcase\\test.doc""
elements = partition_doc(filename=filename)
However, I encountered a UnicodeDecodeError occurring while executing convert_office_doc in logger.info(output.decode().strip())
The error message is as follows:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 38: invalid start byte
@MinnyKuan - Any chance you can share the file that causes the error?
Sure, because I cannot upload a doc file here, I used a compressed file. This is a compressed file of a doc document: test.zip
Thanks @MinnyKuan ! We'll look into this as soon as we can
Hi @MinnyKuan - I was able to process the example doc using partition_doc on a Mac. Are you able to try this on Mac or Linux? From your example code it looks like you're on Windows.
I tried this on Linux The doc file shows the following error message, but the docx file does not.
Error: source file could not be loaded
PackageNotFoundError Traceback (most recent call last)
PackageNotFoundError: Package not found at '/tmp/tmp138uvpct/test.docx'
I also tried using the UnstructuredFileLoader method, but the result was the same.
From my understanding, convert_office_doc converts the doc file to a docx file and then reads it. However, after converting it to a docx file, the file cannot be found.Could it be related to the parameters used in the saveAs function during the conversion from doc to docx?
I am currently still searching for a solution.
ps. I have already installed LibreOffice and LibreOffice Writer
@MinnyKuan is this still a issue for you? I was able to process the example doc using partition_doc on my linux
@scanny Yes, but currently I am converting the doc files to docx format locally before reading them. This allows me to continue using them.
@MinnyKuan A couple things:
- If you have already converted a document from
.docto.docx, you will need to callpartition_docx(...)rather thanpartition_doc(...). - Please post a minimum-reproducible-example that produces this problem in your environment: https://stackoverflow.com/help/minimal-reproducible-example
- Post the entire stack trace you receive, not just the error message on the last line. This will allow us to pinpoint where the error is occuring.
If you can provide those I'll take a look :)
Closing as cannot reproduce, assumed resolved.