Partition_email with process_attachments=True for nested eml
Description When try to process attachment in partition_email, the following error is raised:
Traceback (most recent call last):
File "eml_to_elements.py", line 74, in <module>
process_email(eml_file)
File ".../scripts/preprocessing/eml_to_elements.py", line 58, in process_email
elements = partition_email(
^^^^^^^^^^^^^^^^
File ".../unstructured/unstructured/documents/elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File ".../unstructured/unstructured/file_utils/filetype.py", line 731, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File ".../unstructured/unstructured/file_utils/filetype.py", line 687, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File ".../unstructured/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File ".../unstructured/unstructured/partition/email.py", line 527, in partition_email
extract_attachment_info(msg, tmpdir)
File ".../unstructured/unstructured/partition/email.py", line 232, in extract_attachment_info
f.write(attachment["payload"])
TypeError: a bytes-like object is required, not 'NoneType'
Reproduce
Try to partition an email containing other emails using partition_email, e.g.
import functools
from unstructured.partition.auto import partition
from unstructured.partition.email import partition_email
attachment_partitioner = functools.partial(
partition,
max_partition=None,
include_headers=True,
process_attachments=True,
strategy="hi_res",
hi_res_model_name="yolox",
)
def process_email(document_path: Path):
print("document_path", document_path)
elements = partition_email(
filename=str(document_path),
max_partition=None,
process_attachments=True,
attachment_partitioner=attachment_partitioner,
)
Debug
The reason of this bug is that the method get_payload return None if the part is a multipart and decode=True. Then, the function try to same this None value.
https://github.com/Unstructured-IO/unstructured/blob/c0604670182fb6e4b27268fe264fadda3388f06d/unstructured/partition/email.py#L201-L223
Solution
Filter out multipart part. The method message.walk() will still walk inside the inner emails.
I think that the fix it can be simple as adding
...
if part.is_multipart():
continue
...
after line https://github.com/Unstructured-IO/unstructured/blob/c0604670182fb6e4b27268fe264fadda3388f06d/unstructured/partition/email.py#L207
Hi @S1M0N38 can you provide an email that demonstrates this (mis-)behavior?
I'm just finishing up a broad refactor of partition_email() that takes a different approach overall. I expect this problem disappears in the process but I'd like to have a test to defend that behavior against regression.
I'm sorry, but those emails that I'm working on contain sensitive information. I'm able to strip out such information, nor am I able to generate a new email.
They appear to be nested e-mail files, i.e. when I try to open the e-mail, one attachment is another e-mail with the actual attachments (e.g. pdf, PDF, txt, …).
How about running this email.iterators._structure() stdlib function on the message to give us a firm view of the MIME-part hierarchy:
import email
import email.iterators
MSG_FILE_PATH = "path/to/email.msg"
with open(MSG_FILE_PATH, "rb") as f:
msg = email.message_from_binary_file(f)
print(email.iterators._structure(msg))
which should print something like shown here: https://docs.python.org/3.12/library/email.iterators.html#email.iterators._structure
Yeah, that could be an option.
The bug is that the original implementation of the function
will return None for a multipart section as stated in the Python docs
... If the message is a multipart and the decode flag is True, then None is returned. ...
This will raise the error later when trying to partition a None payload.
My initial solution was to keep using walk for looping over attachments and just skip the multipart parts (if part.is_multipart(): continue). This will still loop over all proper attachments considering them as attached to the "root" email. (This effectively flattens the email nested hierarchy but for me was not a problem).
All these words just to say to check that the attachment you are trying to partition is a part with a not None payload.
Right now, I’ve committed to the old implementation for email parsing and I cannot re-partition email with the new version of unstructured due to project constraints. However, I can provide feedback if I start a new project involving information extraction from email files.
k, np. Closing for now then.