unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Partition_email with process_attachments=True for nested eml

Open S1M0N38 opened this issue 1 year ago • 1 comments

Description When try to process attachment in partition_email, the following error is raised:

Traceback (most recent call last):
  File "eml_to_elements.py", line 74, in <module>
    process_email(eml_file)
  File ".../scripts/preprocessing/eml_to_elements.py", line 58, in process_email
    elements = partition_email(
               ^^^^^^^^^^^^^^^^
  File ".../unstructured/unstructured/documents/elements.py", line 605, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File ".../unstructured/unstructured/file_utils/filetype.py", line 731, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File ".../unstructured/unstructured/file_utils/filetype.py", line 687, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File ".../unstructured/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
  File ".../unstructured/unstructured/partition/email.py", line 527, in partition_email
    extract_attachment_info(msg, tmpdir)
  File ".../unstructured/unstructured/partition/email.py", line 232, in extract_attachment_info
    f.write(attachment["payload"])
TypeError: a bytes-like object is required, not 'NoneType'

Reproduce Try to partition an email containing other emails using partition_email, e.g.

import functools
from unstructured.partition.auto import partition
from unstructured.partition.email import partition_email

attachment_partitioner = functools.partial(
    partition,
    max_partition=None,
    include_headers=True,
    process_attachments=True,
    strategy="hi_res",
    hi_res_model_name="yolox",
)

def process_email(document_path: Path):
    print("document_path", document_path)
    elements = partition_email(
        filename=str(document_path),
        max_partition=None,
        process_attachments=True,
        attachment_partitioner=attachment_partitioner,
    )

Debug The reason of this bug is that the method get_payload return None if the part is a multipart and decode=True. Then, the function try to same this None value.

https://github.com/Unstructured-IO/unstructured/blob/c0604670182fb6e4b27268fe264fadda3388f06d/unstructured/partition/email.py#L201-L223

Solution Filter out multipart part. The method message.walk() will still walk inside the inner emails. I think that the fix it can be simple as adding

...
        if part.is_multipart():
            continue
...

after line https://github.com/Unstructured-IO/unstructured/blob/c0604670182fb6e4b27268fe264fadda3388f06d/unstructured/partition/email.py#L207

S1M0N38 avatar Sep 07 '24 10:09 S1M0N38

Hi @S1M0N38 can you provide an email that demonstrates this (mis-)behavior?

I'm just finishing up a broad refactor of partition_email() that takes a different approach overall. I expect this problem disappears in the process but I'd like to have a test to defend that behavior against regression.

scanny avatar Oct 03 '24 18:10 scanny

I'm sorry, but those emails that I'm working on contain sensitive information. I'm able to strip out such information, nor am I able to generate a new email.

They appear to be nested e-mail files, i.e. when I try to open the e-mail, one attachment is another e-mail with the actual attachments (e.g. pdf, PDF, txt, …).

S1M0N38 avatar Oct 04 '24 16:10 S1M0N38

How about running this email.iterators._structure() stdlib function on the message to give us a firm view of the MIME-part hierarchy:

import email
import email.iterators

MSG_FILE_PATH = "path/to/email.msg"

with open(MSG_FILE_PATH, "rb") as f:
    msg = email.message_from_binary_file(f)

print(email.iterators._structure(msg))

which should print something like shown here: https://docs.python.org/3.12/library/email.iterators.html#email.iterators._structure

scanny avatar Oct 04 '24 18:10 scanny

Yeah, that could be an option.

The bug is that the original implementation of the function

part.get_payload(decode=True)

will return None for a multipart section as stated in the Python docs

... If the message is a multipart and the decode flag is True, then None is returned. ...

This will raise the error later when trying to partition a None payload.


My initial solution was to keep using walk for looping over attachments and just skip the multipart parts (if part.is_multipart(): continue). This will still loop over all proper attachments considering them as attached to the "root" email. (This effectively flattens the email nested hierarchy but for me was not a problem).

All these words just to say to check that the attachment you are trying to partition is a part with a not None payload.

Right now, I’ve committed to the old implementation for email parsing and I cannot re-partition email with the new version of unstructured due to project constraints. However, I can provide feedback if I start a new project involving information extraction from email files.

S1M0N38 avatar Oct 05 '24 09:10 S1M0N38

k, np. Closing for now then.

scanny avatar Oct 05 '24 17:10 scanny