unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Added 'inline' to content disposition check

Open taylorn-ai opened this issue 1 year ago • 13 comments

See discussion here. This adds inline to the allowed list of Content Disposition values when parsing EML files.

taylorn-ai avatar Aug 07 '24 04:08 taylorn-ai

CC @MthwRobinson :)

cragwolfe avatar Aug 07 '24 05:08 cragwolfe

@scanny @cragwolfe - one of the tests is failing but unrelated to my change. I’ll leave it for you guys to fix?

taylorn-ai avatar Aug 07 '24 06:08 taylorn-ai

@TaylorN15 - Is there a test you could add for this?

MthwRobinson avatar Aug 07 '24 14:08 MthwRobinson

@TaylorN15 - Is there a test you could add for this?

I checked and there is already a test EML file with inline and None for content disposition within the example docs that gets run during the test 😃

example-docs/eml/email-inline-content-disposition.eml

taylorn-ai avatar Aug 07 '24 17:08 taylorn-ai

Awesome! Would we have expected that test to fail prior to the fix?

MthwRobinson avatar Aug 07 '24 17:08 MthwRobinson

No, it would have passed except the inline content would not have been returned. I thought that was a bit strange that it was in the test file but not catered for in the code.

¯_(ツ)_/¯

taylorn-ai avatar Aug 07 '24 17:08 taylorn-ai

In that case, could we add a test that asserts that the inline content is included now?

MthwRobinson avatar Aug 07 '24 20:08 MthwRobinson

In that case, could we add a test that asserts that the inline content is included now?

It seems the test already has an assertion for content disposition of inline and None and the test passes in the current and this PR version of the code...

taylorn-ai avatar Aug 07 '24 21:08 taylorn-ai

@TaylorN15 here's what we need: A test that fails before your change and passes afterward.

If the current test can be modified to suit, that's fine. I can see the assertions there are pretty weak, maybe you can add something there.

But what I would do is this:

  • take your original problematic email with the MSIP labels, make a copy of it and anonymize it as necessary but keep all the parts. Add that to example-docs/.
  • write a test that does what you were trying to do and fails when run on main, just like the failure you were seeing, like no text content or whatever it was.
  • add that test in this PR and demonstrate that it passes (passing CI is this demonstration).

Basic idea is:

  • if we don't have a test that fails when we take out your change, how do we know your change fixed it?
  • what will defend against regression if no test fails when your change is accidentally removed?
  • Also it gives us a basis for reasoning about whether your change fixed the general-case problem or only an isolated use case.

scanny avatar Aug 07 '24 23:08 scanny

@scanny I understand, I'm just not so great at writing tests :) I need to work out what the difference is between the EML files to understand why one works when the other does not. I am guessing it's something to do with setting process_attachments as I don't do that with my version, but the code that processes attachments includes content disposition of inline and attachment

taylorn-ai avatar Aug 08 '24 00:08 taylorn-ai

@TaylorN15 why don't you post an anonymized version of the problematic message and I'll take a look. I expect I can whip up an appropriate test in a minute or two :)

scanny avatar Aug 08 '24 00:08 scanny

@TaylorN15 why don't you post an anonymized version of the problematic message and I'll take a look. I expect I can whip up an appropriate test in a minute or two :)

The original email I linked should suffice. I have just confirmed though that the issue is whether process_attachments is set or not. If true, all the content with a content disposition (including inline) is processed, however in my application I am not processing attachments which is how I first discovered this. The text in my email with the content disposition of inline isn't an attachment though, so I think the assumption that all message body/text has no content disposition is false.

taylorn-ai avatar Aug 08 '24 00:08 taylorn-ai

k, this test should be plenty:

def test_partition_reads_message_part_with_inline_content_disposition():
    elements = partition_email(
        example_doc_path("eml/text-part-marked-inline.eml"), process_attachments=False
    )

    assert len(elements) == 1
    e = elements[0]
    assert e.text.startswith("Hi   Please find attached a project proposal.")
    assert e.text.endswith("Kind regards User  ")

Just name your original file text-part-marked-inline.eml and add it to example-docs/eml/.

You can add this test right after the one test_partition_email_inline_content_disposition().

scanny avatar Aug 08 '24 02:08 scanny