eml_parser icon indicating copy to clipboard operation
eml_parser copied to clipboard

Create new Python Bug to Header Parsing Issue

Open malvidin opened this issue 4 years ago • 7 comments

In test_headeremail2list_2, it mentions Python bug 27257. However, Bug 27257 appears to be related to empty groups in the header, not issues with obsolete period. With Python 3.7, I do not have any issues with the decoded value, unless the eml_parser should include address groups. https://github.com/GOVCERT-LU/eml_parser/blob/f98980a77d9c7d914d97525a62294075c0ce42d9/tests/test_emlparser.py#L131

From the bug:

To: unlisted-recipients: ;, ""@pop.kundenserver.de (no To-header on input) The current output below appears to be the expected output. 'to': ['@pop.kundenserver.de']

From the RFC:

To: A Group:Ed Jones [email protected],[email protected],John [email protected]; Again, the current output below appears to be the expected output. 'to': ['[email protected]', '[email protected]', '[email protected]']

I have not found a related issue in the Python bug tracker, but perhaps something like the following in _header_value_parser.py would be appropriate to prevent the exception:

malvidin avatar Feb 26 '20 18:02 malvidin

Thanks for your analysis. I agree that 27257 does not seem to be related. I unfortunately don't recall this exactly, but I probably meant another one instead.

Regarding the workaround, this is still necessary though, same on 3.7 as on 3.8. I just retested it with the problematic sample included in the samples folder of this repo.

Regarding your suggestion, _header_value_parser is private so I can't include that one. I haven't tested it but from looking at that function I don't think it would solve the issue I am trying to workaround "Test.[email protected]". Did you test this? Would you be interested in making a pull-request ?

sim0nx avatar Feb 28 '20 08:02 sim0nx

With the modification to the Python 3.7 email._header_value_parser.py, the following is my output. This causes test_headeremail2list_2 to fail, as intended, because the default Python header parser succeeds.

I created pull request 18687 to address this issue. https://github.com/python/cpython/pull/18687

>>> msg_test = email.message_from_string("""From:         John Doe.<[email protected]>

Test e-mail. with a https://www.google.com:5000?test
""", policy=email.policy.default)

>>> msg_from = msg_test.get_all('from')
>>> print(msg_from[0].addresses[0].display_name, msg_from[0].addresses[0].addr_spec)
John Doe. [email protected]

>>> print(json.dumps(eml_parser.eml_parser.parse_email(msg_test), indent=2, default=json_serial))
{
  "body": [
    {
      "uri_hash": [
        "ac6bb669e40e44a8d9f8f0c94dfc63734049dcf6219aac77f02edf94b9162c09"
      ],
      "content_header": {},
      "hash": "a46645c9d7598af7036fc173380b1bce4fe6a4e16313523e29e31cbee6eec6e2"
    }
  ],
  "header": {
    "subject": "",
    "from": "[email protected]",
    "to": [],
    "date": "1970-01-01T00:00:00+00:00",
    "received": [],
    "header": {
      "from": [
        "\"John Doe.\" <[email protected]>"
      ]
    }
  }
}

malvidin avatar Feb 28 '20 17:02 malvidin

With the modification to the Python 3.7 email._header_value_parser.py, the following is my output. This causes test_headeremail2list_2 to fail, as intended, because the default Python header parser succeeds.

I created pull request 18687 to address this issue. python/cpython#18687

Great! Thank you!

sim0nx avatar Mar 03 '20 10:03 sim0nx

This appears to be related to this issue. The pull request I made only addresses one case, I'll look at addressing the other later this week. https://bugs.python.org/issue30988

malvidin avatar Mar 03 '20 13:03 malvidin

This pull addresses the issue more completely, so I closed my pull request. https://github.com/python/cpython/pull/15600

The following can be used to

import inspect
import email
import email.policy

display_name_source = inspect.getsource(email._header_value_parser)
header_parser_15600 = [
    ("if res[0][0].token_type == 'cfws':", 
     "if isinstance(res[0], TokenList) and res[0][0].token_type == 'cfws':"),
    ("if res[-1][-1].token_type == 'cfws':", 
     "if isinstance(res[-1], TokenList) and res[-1][-1].token_type == 'cfws':"),
    ('''
        if leader is not None:
            token[0][:0] = [leader]
            leader = None
        name_addr.append(token)
''', '''
        if leader is not None:
            if isinstance(token[0], TokenList):
                token[0][:0] = [leader]
            else:
                token[:0] = [leader]
            leader = None
        name_addr.append(token)
''')
]

display_name_source_new = display_name_source
for prev, fix in header_parser_15600:
    display_name_source_new = display_name_source_new.replace(prev, fix)

exec(display_name_source_new , email._header_value_parser.__dict__)


email.message_from_string("""From:         John Doe.<[email protected]>
To: . Doe <[email protected]>

Test e-mail body.
""", policy=email.policy.default).items()

malvidin avatar Mar 04 '20 10:03 malvidin

An upstream fix should be deployed, I'll try to find time to check this week.

https://github.com/python/cpython/pull/15600

malvidin avatar Apr 29 '24 21:04 malvidin

It is fixed in Python 3.13.0b1, so it should make it into Python 3.13 this fall.

tests\test_emlparser.py:250 (TestEMLParser.test_headeremail2list_2)
self = <tests.test_emlparser.TestEMLParser object at 0x00000000050F50F0>

    def test_headeremail2list_2(self) -> None:
        """Here we test the headeremail2list function using an input which should trigger
        a email library bug 27257
        """
        with pathlib.Path(samples_dir, 'sample_bug27257.eml').open('rb') as fhdl:
            raw_email = fhdl.read()
    
        msg = email.message_from_bytes(raw_email, policy=email.policy.default)
    
        # just to be sure we still hit bug 27257 (else there is no more need for the workaround)
>       with pytest.raises(AttributeError):
E       Failed: DID NOT RAISE <class 'AttributeError'>

test_emlparser.py:261: Failed

malvidin avatar May 15 '24 10:05 malvidin