unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

UnicodeDecodeError in logger.info during the execution of partition_doc

Open MinnyKuan opened this issue 1 year ago • 5 comments

hi, I'm using version 0.11.8. I use the following code to execute partition_doc :


from unstructured.partition.doc import partition_doc

filename = ""D:\\Testcase\\test.doc""
elements = partition_doc(filename=filename)

However, I encountered a UnicodeDecodeError occurring while executing convert_office_doc in logger.info(output.decode().strip()) The error message is as follows: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 38: invalid start byte

MinnyKuan avatar Jan 24 '24 02:01 MinnyKuan

@MinnyKuan - Any chance you can share the file that causes the error?

MthwRobinson avatar Jan 24 '24 13:01 MthwRobinson

Sure, because I cannot upload a doc file here, I used a compressed file. This is a compressed file of a doc document: test.zip

MinnyKuan avatar Jan 25 '24 01:01 MinnyKuan

Thanks @MinnyKuan ! We'll look into this as soon as we can

MthwRobinson avatar Jan 25 '24 13:01 MthwRobinson

Hi @MinnyKuan - I was able to process the example doc using partition_doc on a Mac. Are you able to try this on Mac or Linux? From your example code it looks like you're on Windows.

MthwRobinson avatar Feb 06 '24 18:02 MthwRobinson

I tried this on Linux The doc file shows the following error message, but the docx file does not.

Error: source file could not be loaded
PackageNotFoundError                      Traceback (most recent call last)
PackageNotFoundError: Package not found at '/tmp/tmp138uvpct/test.docx'

image

I also tried using the UnstructuredFileLoader method, but the result was the same. From my understanding, convert_office_doc converts the doc file to a docx file and then reads it. However, after converting it to a docx file, the file cannot be found.Could it be related to the parameters used in the saveAs function during the conversion from doc to docx? I am currently still searching for a solution.

ps. I have already installed LibreOffice and LibreOffice Writer

MinnyKuan avatar Feb 23 '24 06:02 MinnyKuan

@MinnyKuan is this still a issue for you? I was able to process the example doc using partition_doc on my linux

srjchauhan avatar Apr 20 '24 07:04 srjchauhan

@scanny Yes, but currently I am converting the doc files to docx format locally before reading them. This allows me to continue using them.

MinnyKuan avatar Apr 22 '24 05:04 MinnyKuan

@MinnyKuan A couple things:

  1. If you have already converted a document from .doc to .docx, you will need to call partition_docx(...) rather than partition_doc(...).
  2. Please post a minimum-reproducible-example that produces this problem in your environment: https://stackoverflow.com/help/minimal-reproducible-example
  3. Post the entire stack trace you receive, not just the error message on the last line. This will allow us to pinpoint where the error is occuring.

If you can provide those I'll take a look :)

scanny avatar Apr 22 '24 17:04 scanny

Closing as cannot reproduce, assumed resolved.

scanny avatar Apr 27 '24 00:04 scanny