unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/partition_msg halts for attachmentes with UNK type

Open S1M0N38 opened this issue 1 year ago • 7 comments

Currently, the attachment_partitioner is hardcoded to partition in the following file:

https://github.com/Unstructured-IO/unstructured/blob/50d75c47d346413de8a6dbbcf72009871f7ecd56/unstructured/partition/msg.py#L259-L277

However, according to the official documentation, the partition-msg function accepts attachment_partitioner as an argument.

S1M0N38 avatar Sep 27 '24 18:09 S1M0N38

#3605 PR (Draft)

S1M0N38 avatar Sep 27 '24 18:09 S1M0N38

@S1M0N38 that was removed on purpose. @Paul-Cornell can you remove that parameter from the docs for us?

scanny avatar Sep 27 '24 19:09 scanny

When a .msg file contains an attachment of an unsupported type (UNK), the partition_msg function halts. I've implemented a custom attachment_partitioner to filter out unsupported types.

Is there another way to process the supported types and ignore the unsupported ones?

S1M0N38 avatar Sep 27 '24 19:09 S1M0N38

@scanny I'm not quite sure how to "remove" that parameter from the docs. Searching for attachment_partitioner across the docs returns four results, some in text, and some in code. I don't have enough context to know exactly what needs to be removed here. Could you please either advise, or create a PR in the docs repo? Thanks!

Paul-Cornell avatar Sep 27 '24 22:09 Paul-Cornell

@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :)

Shall we make this into a bug report for that or do you want to open a new one?

The correct behavior would be for partition_msg() to simply skip any attachments it doesn't know how to partition.

scanny avatar Sep 28 '24 01:09 scanny

@Paul-Cornell it looks like the text is about the same for partition_email() and partition_msg() on that page:


You can optionally partition e-mail attachments by setting process_attachments=True. If you set process_attachments=True, you’ll also need to pass in a partitioning function to attachment_partitioner. The following is an example of what the workflow looks like:

from unstructured.partition.auto import partition
from unstructured.partition.email import partition_email

filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(
  filename=filename, process_attachments=True, attachment_partitioner=partition
)

You can change it like this in each case:

You can optionally partition e-mail attachments by setting process_attachments=True. The following is an example of what the workflow looks like:

from unstructured.partition.email import partition_email

filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(filename=filename, process_attachments=True)

So:

  • get rid of the middle sentence
  • remove the import of partition
  • remove the attachment_partitioner=partition argument in the partition_email() call.

scanny avatar Sep 28 '24 05:09 scanny

@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :)

Shall we make this into a bug report for that or do you want to open a new one?

The correct behavior would be for partition_msg() to simply skip any attachments it doesn't know how to partition.

let's make this issue into a bug report and modify the title accordingly.

S1M0N38 avatar Sep 28 '24 07:09 S1M0N38