poetry-core Improve parsing of author information

Instead of relying on regular expressions, this patch leverages Python’s builtin email.utils.parseaddr() functionality to parse an RFC-822-compliant email address string into its name and address parts.

This should also resolve issues with special characters in the name part; see for example:

https://github.com/sdispater/poetry/issues/370
https://github.com/sdispater/poetry/issues/798

This is a followup to python-poetry/poetry PR #1040, as advised by @Secrus’ comment.

If this PR gets accepted, there will be two additional steps:

Refactor poetry.console.commands.init.InitCommand._validate_author() to use the new parse_author() helper, see https://github.com/python-poetry/poetry/blob/master/src/poetry/console/commands/init.py#L441
Remove AUTHOR_REGEX from poetry.core.packages.package.

[X] Added tests for changed code.
[ ] Updated documentation for changed code.

Nov 12 '22 21:11 yggi49

It seems it will be hard (if not impossible) to make this backward compatible, given the existing tests for valid and invalid author names. test_package_authors_valid() creates a few combinations that contradict RFC-822; on the other hand, test_package_author_names_invalid() has a couple of cases that would actually be valid names.

What’s the general concept in Poetry for author and maintainer strings? Judging from the test suite, it appears that a name is mandatory, whereas an email address is optional. Then again, the ValueError raised in poetry.core.packages.package.Package._get_author() explicitly says that author strings “Must be in the format: John Smith <[email protected]>”, indicating that an email address must be provided as well. (The same also goes for poetry.core.packages.package.Package._get_maintainer().)

For reference: According to the PyPA specification (derived from PEP-621), author and maintainer information can be either name-only, email-only, or name and email:

If only name is provided, the value goes in Author or Maintainer as appropriate.

If only email is provided, the value goes in Author-email or Maintainer-email as appropriate.

If both email and name are provided, the value goes in Author-email or Maintainer-email as appropriate, with the format {name} <{email}>.

Multiple values should be separated by commas.

Furthermore, RFC-822-compliance for names is mentioned as well:

The name value MUST be a valid email name (i.e. whatever can be put as a name, before an email, in RFC 822) and not contain commas. The email value MUST be a valid email address.

TL;DR: I think it needs to be decided whether to adopt the PyPA spec (at least in this regard, which would be a breaking change), or if Poetry wants to continue having its own concept of what constitutes valid author/maintainer strings. In the latter case, this PR can probably be closed without further ado.

Nov 14 '22 11:11 yggi49

Hello @yggi49,

thanks for these details :+1: . IMO the goal must be PyPA spec compliance. But maybe in two steps.

Within the current scope of your PR we should switch from the regex implementation to email.utils, but keep the author name mandatory and the email optional. This results in less strict rules for the names - which is backwards compatible. Tests must be adopted accordingly.

In a second PR we can than make name optional as well (unless other maintainers have a veto) and also make sure that name/email goes into the correct metadata fields as described in https://packaging.python.org/en/latest/specifications/declaring-project-metadata/#authors-maintainers.

fin swimmer

Nov 23 '22 08:11 finswimmer

Thank you for providing guidance, @finswimmer. I am going to tweak the logic accordingly (name mandatory, email address optional), and adjust the tests as needed.

Nov 23 '22 09:11 yggi49

I'm not a fan of how the tests are structured in this PR; I would like something much more thorough and incorporated to the existing tests (see the tests/additions in #517).

Nov 23 '22 19:11 neersighted

@finswimmer, I adapted the logic towards PyPA-compliance, with a mandatory name and an optional email address. Check out the test cases for the new parse_author() helper on what is supported and what is disallowed; I hope this matches your expectations. The non-RFC-conformant cases were added due to them being discussed in python-poetry/poetry PR #1040.

@neersighted, my primary goal was to factor out author parsing into a single separate helper method and provide a comprehensive list of test cases for it, to showcase the expected behavior. Initially, I didn’t want to mess around too much with what was there already, but merely adapted the cases of existing tests accordingly. Personally, I think that the tests in test_package.py—test_package_authors_invalid(), test_package_authors_valid(), and test_package_author_names_invalid()—might no longer be needed and could be removed, as everything should already be covered by the tests for parse_author().

Nov 24 '22 08:11 yggi49

Thanks for making progress on that @yggi49 :+1:

I'm still a bit irritated by the behavior of email.utils.parseaddr() in some cases, e.g.:

email.utils.parseaddr("Me [Some Department] <[email protected]>")
('', 'Me')

I would expect ('Me [Some Department] ', '[email protected]').

email.utils.parseaddr("Me <[email protected]")
('Me', '[email protected]')

I would expected either a parsing error (preferred) or ('Me <[email protected], '').

I don't understand why it behaves like this. Expected? Bug? How do we want to handle this?

We could check if we can reconstruct the address that goes into parseaddr, with the returned value and throw an error if not. Bu that would mean, that the example above is really invalid. Where are the specs for this? :thinking:

fin swimmer

Nov 24 '22 08:11 finswimmer

If brackets should be part of the name, the name must be quoted—i.e., it should read "Me [Some Department]" <[email protected]>. The same goes for white space and other special characters: ()<>@,;:\"..

Python’s address parsing is implemented in email._parseaddr.AddrlistClass. I didn’t really dive into it, and neither did I do a deep dive into the RFC, so I cannot tell what the unquoted Me [Some Department] <[email protected]> actually means. (Email address strings can get quite complicated, covering a couple of arcane use cases, and I don’t think it’s worth reinventing the wheel here.)

I did implement a poor-man’s reverse check into the parse_author() helper, though, and calling parse_author("Me [Some Department] <[email protected]>") will raise a ValueError.

Nov 24 '22 10:11 yggi49

It's a rabbit hole ... :rabbit2:

I asked at the PyPA discord server about the behavior of email.utils.parseaddr(). And as you've already find out, quoting (double quotes only!) is the key. They came up with another solution for parsing name and email that do also a validation:

>>> from email.headerregistry import AddressHeader
>>> parsed = AddressHeader.value_parser("Me [Some Department] <[email protected]>")
>>> len(parsed.all_defects) > 0
True

>>> parsed = AddressHeader.value_parser('"Me [Some Department]" <[email protected]>')
>>> parsed.all_defects
[]
>>> parsed.mailboxes[0].addr_spec
'[email protected]'
>>> parsed.mailboxes[0].display_name
'Me [Some Department]'

So if we detect an @ in the address we might go this way to parse out name and email. But ...

I'm not sure if the current supported values for a name would already require to quote the name. If yes, this would be indeed a breaking change if we now say: please quote your name.
Require to quote the name in pyproject.toml is ugly. Maybe it should be the responsible of the backend to quote the name if needed when writing to Author-email metadata. But if the name is not quoted in the pyproject.toml, we need some kind of pre-processing to add some, before we can parse out name and email as described above.

This all leads to the question: Is it all worth it? I don't know and does not have a strong opinion about it. Maybe some of the other maintainers?

Nov 25 '22 08:11 finswimmer

I'm not sure if the current supported values for a name would already require to quote the name. If yes, this would be indeed a breaking change if we now say: please quote your name.

@finswimmer, I already mentioned that this is going to be the case, since the existing regular expression for the name part (?P<name>[- .,\w\d'’\"():&]+) allows several special characters—specifically (),.:", plus space—that the RFC requires to be quoted.

Maybe it’s easier to approach this entire topic from the other side, and make a basic strategic decision for this project first: is PyPA compatiblity the goal, or does Poetry want to keep doing it its own way?

In case of the former, this is inevitably going to introduce a breaking change. Plus, you won’t have to worry whether quoted names in pyproject.toml are ugly or not, since that decision has already been made by someone else, as quoted above.

I don’t think, though, that the breaking change would hurt too much, as the majority of cases probably still could go unquoted. After all, only names that contain special characters require quoting.
If Poetry wants to pursue its own rules, it should then be specified what is supported and what isn’t by defining the test cases for the parse_author() helper, and adapt the implementation accordingly.

Nov 25 '22 16:11 yggi49

Kudos, SonarCloud Quality Gate passed!