bug/partition_msg halts for attachmentes with UNK type
Currently, the attachment_partitioner is hardcoded to partition in the following file:
https://github.com/Unstructured-IO/unstructured/blob/50d75c47d346413de8a6dbbcf72009871f7ecd56/unstructured/partition/msg.py#L259-L277
However, according to the official documentation, the partition-msg function accepts attachment_partitioner as an argument.
#3605 PR (Draft)
@S1M0N38 that was removed on purpose. @Paul-Cornell can you remove that parameter from the docs for us?
When a .msg file contains an attachment of an unsupported type (UNK), the partition_msg function halts. I've implemented a custom attachment_partitioner to filter out unsupported types.
Is there another way to process the supported types and ignore the unsupported ones?
@scanny I'm not quite sure how to "remove" that parameter from the docs. Searching for attachment_partitioner across the docs returns four results, some in text, and some in code. I don't have enough context to know exactly what needs to be removed here. Could you please either advise, or create a PR in the docs repo? Thanks!
@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :)
Shall we make this into a bug report for that or do you want to open a new one?
The correct behavior would be for partition_msg() to simply skip any attachments it doesn't know how to partition.
@Paul-Cornell it looks like the text is about the same for partition_email() and partition_msg() on that page:
You can optionally partition e-mail attachments by setting process_attachments=True. If you set process_attachments=True, you’ll also need to pass in a partitioning function to attachment_partitioner. The following is an example of what the workflow looks like:
from unstructured.partition.auto import partition
from unstructured.partition.email import partition_email
filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(
filename=filename, process_attachments=True, attachment_partitioner=partition
)
You can change it like this in each case:
You can optionally partition e-mail attachments by setting process_attachments=True. The following is an example of what the workflow looks like:
from unstructured.partition.email import partition_email
filename = "example-docs/eml/fake-email-attachment.eml"
elements = partition_email(filename=filename, process_attachments=True)
So:
- get rid of the middle sentence
- remove the import of
partition - remove the
attachment_partitioner=partitionargument in thepartition_email()call.
@S1M0N38 Ahh, okay, so that's a bug then. You shouldn't have to provide a custom partitioner for that :)
Shall we make this into a bug report for that or do you want to open a new one?
The correct behavior would be for
partition_msg()to simply skip any attachments it doesn't know how to partition.
let's make this issue into a bug report and modify the title accordingly.