wordpress-develop icon indicating copy to clipboard operation
wordpress-develop copied to clipboard

Support rfc6530

Open arnt opened this issue 2 years ago • 10 comments

This adds support for unicode email addresses in is_email and sanitise_email. It is intended to be sufficient to accept unicode email addresses in contact forms etc.,

it does not aspire to unicode support everywhere, in particular it doesn't deal with unicode in logins or passwords.

It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one, but don't quite understand your testing policies. It looks like you want to keep the tests fast by avoiding unnecessary tests, I'd say, but I don't understand what you consider necessary and unnecessary. Anyway, if you want a file I'm happy to add one.

Trac ticket: https://core.trac.wordpress.org/ticket/31992

arnt avatar Sep 18 '23 07:09 arnt

I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.

arnt avatar Sep 22 '23 10:09 arnt

In response to @arnt:

It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one...

Absolutely, please do add unit tests (or updates to existing tests) for sanitize_email(). While Core test code coverage is pretty low today 😓, it is a project policy that new Core merges include appropriate tests.

I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.

It's not uncommon for E2E test timeouts to occur in this repo's automated jobs. These can be ignored.

ironprogrammer avatar Sep 27 '23 00:09 ironprogrammer

Thanks for the updated PR! The updates to sanitize_email() appear to have resolved the issue with storing the correct email in the database 🎉.

This isn't a formal test report (there are lots of scenarios needed), but I ran through a few smoke tests so as to get back to you quickly on what could be addressed next. It would be great to get additional reviewers/testers involved to cover what I've missed:

  • ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
  • ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
  • ❓ Do new user creation, email change confirmation, or password change emails go to the correct address?
  • ❓ Do links in these emails work correctly?
  • ❓ Does the forgotten password workflow work correctly with a unicode address?
  • ✅ Login: logging in with a unicode email works as expected.
  • ✅ Add post comment: email with unicode is accepted for post comments.
  • ❌ Comments (approval list): emails with unicode are not displayed correctly (Figure 1).
  • ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
  • ❓ Site setup wizard: can the initial admin use a unicode address, and do the emails work correctly?
  • ❓ Do various author dropdowns display unicode addresses as expected?
  • ❓ Are unicode emails in feeds presented accurately?

Figure 1: Display of commenter unicode email addresses. comments approval screen with incorrectly displayed unicode email addresses

ironprogrammer avatar Sep 27 '23 00:09 ironprogrammer

Sigh. I see what you mean.

My own focus is/was quite different: I wanted Contact Form 7 and similar to accept unicode email addresses, which requires some utility functions in Wordpress to do the right thing. I didn't aspire to complete support in Wordpress. The thinking behind your response seems to be that either Wordpress supports it properly or not at all, there's no inbetween.

And, sadly, I think you're right and I was wrong.

I'll extend the PR. It'll take a few weeks. I'll also find some testers.

arnt avatar Sep 27 '23 07:09 arnt

Hi,

you asked for wider testing. Around 20 sites have now run the patch in production, the longest-running one for more than six months already. Nothing has broken. The sites are real (University of Somewhere, not someone's private five-visitor site).

If you want, I can send you the email addresses of some testers. When I pinged them last week, around half said it was okay to give you their addresses, I'm sure most will say the same if I ask a few times ;)

arnt avatar Sep 16 '24 06:09 arnt

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Unlinked Accounts

The following contributors have not linked their GitHub and WordPress.org accounts: @peteresnick.

Contributors, please read how to link your accounts to ensure your work is properly credited in WordPress releases.

Core Committers: Use this line as a base for the props when committing in SVN:

Props agulbra, justlevine, sirlouen, mukesh27, tusharbharti, dmsnell, ironprogrammer, sergeybiryukov, mdawaffe.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions[bot] avatar Sep 16 '24 06:09 github-actions[bot]

I liked the accounts earlier today.

arnt avatar Sep 16 '24 07:09 arnt

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions[bot] avatar Sep 23 '24 12:09 github-actions[bot]

Thanks for the update, @arnt, and for seeking wider testing! There are still some outstanding questions and one previous failure to consider, but it's great to see this moving forward! 🙌🏻

A good next step would be to rebase this PR on the latest trunk, and to resolve the coding standards issues to tidy things up for an updated review and testing. When the CI checks are in a happy state, then it will help attract more attention.

ironprogrammer avatar Sep 23 '24 16:09 ironprogrammer

Finding testers who were willing and able to run a patch on sites that mattered to them was hard work. That took a good long while. Do you want their names? If so, please send me mail.

Rebasing is not a problem, of course. The other stuff was a little difficult. I only recently found a good workaround for IntlChar's lack of support for UAX24.

Can you suggest a model end-to-end test that I could emulate and write automatic tests for the desired features? Manual testing is possible, but I'd much prefer to have complete automatic test coverage.

arnt avatar Sep 23 '24 19:09 arnt

Hi,

the remaining problems, as far as I can tell:

  1. The code now requires PHP 7.4, which I thought was OK but now I see that 7.4 is merely recommended, not required. Oops. Not clear to me how to do mb_str_split in 7.2.
  2. I've clicked around, but don't feel sure that I have all of the "various dropdowns". The filters in default-filters.php are okay, but I don't feel confident that every third-party plugin will be okay, or that I've looked at everything in the UI. Wordpress is large, there are nooks and corners…

arnt avatar Dec 03 '24 14:12 arnt

I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().

A bit on testing:

  • ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
  • ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
  • ✅ New user creation, email change confirmation, and password change emails go to the correct address.
  • ✅ Links in all messages work correctly AFAICT (plugins can add new kinds of course…).
  • ✅ The forgotten password workflow works correctly with a unicode address.
  • ✅ Login: logging in with a unicode email works as expected.
  • ✅ Add post comment: email with unicode is accepted for post comments.
  • ✅ Comments (approval list): emails with unicode are displayed correctly.
  • ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
  • ✅ WordPress works on an IPv6-only server (oh, you weren't wondering?)

And… oh drat… Must retest the site setup wizard, too much code has changed.

Two questions remain.

  1. What author dropdowns display email addresses? I can't find anything. Lots of user names displayed but not email addresses. Maybe I misunderstand. I checked for usage of email in the code that reads the database and use of that code in display code.
  2. How can email addresses ever appear in feeds? The feed generators don't seem to reveal authors etc.

arnt avatar Dec 04 '24 09:12 arnt

Just set up a new site from scratch using 'rød@grå.org' as admin address; it works.

arnt avatar Dec 04 '24 10:12 arnt

Hi @ironprogrammer,

would you mind having a look?

Also, on an unrelated matter, I'm with you. I've worked as a full-time maintainer on a widely used open source project and got a lot of flak. It hurt. Some people can be really shitty. Don't let it grind you down.

arnt avatar Jan 03 '25 11:01 arnt

Thanks a lot, @arnt, and apologies for not being involved with this the past several months 🙏🏻. I've got some other work priorities at this time, but hope to be able to dive back in soon!

In the meantime, for code review purposes, I would suggest raising this Trac ticket during #core Dev Chat, or dropping in a comment on the next agenda post to add it to open floor discussion. This will help spread some awareness of the initiative. If you're available on Slack and sync during that time to answer questions, that would be ideal.

Glad you could wrangle additional testing for the updates! If this info can be made public through test reports in Trac, that would be amazing -- and also prerequisite to merging as Core committers begin taking a look at the ticket.

You could get additional visibility by sharing the ticket in the #core-test channel. There may also be contributors willing to help put test instructions together based on what you've got so far. This update covers a lot of territory, so the more detailed the tests for each use case, the better.

ironprogrammer avatar Jan 13 '25 22:01 ironprogrammer

Hi @arnt, I was curious if this PR would also handle things related to rfc 5322. As I have a port of this ready in JavaScript ( Initially for @wordpress/url's isEmail ).

If we are fine with this, I can help with that in form of reviews. Regards.

USERSATOSHI avatar Jul 21 '25 18:07 USERSATOSHI

I know 5322 very well, and also its replacement and am not aware of any related problems in Wordpress or PHPMailer. (My name is in both documents.) I'd be very happy to fix any related problems, if you'll point out anything specific.

I'm also more than happy to look at your javascript code.

(CORRECTION: No, my name isn't in 5322bis. The editor removed the names, the list grew too long.)

arnt avatar Jul 21 '25 21:07 arnt

I know 5322 very well, and also its replacement and am not aware of any related problems in Wordpress or PHPMailer. (My name is in both documents.) I'd be very happy to fix any related problems, if you'll point out anything specific.

(CORRECTION: No, my name isn't in 5322bis. The editor removed the names, the list grew too long.)

Ah my bad then, I was just doing tests using the current patch and saw some failing tests for comments, quotes and short domains ( I think that's what they called ? ), hence I wanted to add some improvements on them. 😄

I'm also more than happy to look at your javascript code.

Sure! here is the code Link

main focus points would be: #L161, #L175, #L189-L241

and then I guess supporting ip addresses.

This was what I was able to support from my JavaScript version

Screenshot

Current Patch Converage

image

My isEmail Converage

image

edit: updated code link cause file changed from JavaScript to TypeScript

cc: @arnt

USERSATOSHI avatar Jul 21 '25 22:07 USERSATOSHI

That's admirably thorough testing!

By coinidence, Pete Resnick (the author of RFC 5322) and I were in the same room when I saw your table. We went over the cases together and agree about most of it.

First, a bit about the email address syntaxes. There are four:

  • 5322 defines a very permissive syntax for interacting with legacy software (to be used e.g. when you parse old mail archives)
  • 5322 defines a stricter syntax for generating mail today
  • 5321 defines an even stricter syntax for addresses used with SMTP
  • HTML defines an even stricter syntax for using addresses in HTML forms

In general it's best to use addresses that meet the requirements of all four. Addresses that are valid according to only some rulesets tend to cause problems for users.

Now, to the addresses, with my opinions for how to handle each:

  • [email protected] and [email protected] are invalid according to RFC 5321, therefore Wordpress cannot send mail to them. Rejecting these is a good idea.

  • (to)[email protected] and [email protected](to) will be mangled during transmission by 5321-compliant software. Both addresses will arrive as [email protected]. Rejecting these is a good idea (while allowing [email protected] of course).

  • cow@[dead::beef], 1@23456789 and a few others use address literals, which have been used for attacks. The same attacks apply to Wordpress (you could use this to probe for IP addresses that the Wordpress host can reach, but which are firewalled away from the general network). Rejecting these is a good idea.

  • toto@co and admin@mailserver1 are local addresses, not world-reachable. (They could also be addresses within a TLD, but none of the TLDs allow that right now.) You could ssh to a Wordpress server and run 'ping co', because ping allows a local hostname. However, it's almost certainly bad to let users enter a local address in a Wordpress contact form (an attacker could use that to probe the local networkl). Rejecting these addresses makes sense.

  • 1234567890123456789012345678901234567890123456789012345678901234+x@example.com is longer than one of the RFC limits. Most software supports longer localparts today. However, I surveyed a corpus of about 500,000 addresses, and only software used longer localparts, no users. issue-1234567890123456789012345678901234567890123456789012345678901234@support.example.com is a good example. If Wordpress allows longer localparts, bots will use that to spam the contact forms, and it will not help people. My opinion is that it's probably best to reject these. Pete doesn't have an opinion on this address.

  • user@[IPv6:2001:db8::1]:8080 passes, and I think it should not. It's not deliverable via SMTP on the public internet. However, it's also not a unicode matter. I'm uncertain whether this PR should remove support for it.

Thanks for this testing! I'll add some of the addresses to the unit tests you used to the unit tests and push a new revision. I may change the handling of the :8080 address too.

arnt avatar Jul 22 '25 10:07 arnt

Thanks for sharing such a detailed explanation!.

(to)[email protected] and [email protected](to) will be mangled during transmission by 5321-compliant software. Both addresses will arrive as [email protected]. Rejecting these is a good idea (while allowing [email protected] of course).

O.O, I did think that () being treated as comment would cause issue like same address on arrival, but I thought as we are only doing looks like an email ( actual description for functionality of isEmail), accepting them would have been good Idea, but yeah rejecting them would make a lot of sense from a practical point of view.

cow@[dead::beef], 1@23456789 and a few others use address literals, which have been used for attacks. The same attacks apply to Wordpress (you could use this to probe for IP addresses that the Wordpress host can reach, but which are firewalled away from the general network). Rejecting these is a good idea.

toto@co and admin@mailserver1 are local addresses, not world-reachable. (They could also be addresses within a TLD, but none of the TLDs allow that right now.) You could ssh to a Wordpress server and run 'ping co', because ping allows a local hostname. However, it's almost certainly bad to let users enter a local address in a Wordpress contact form (an attacker could use that to probe the local networkl). Rejecting these addresses makes sense.

Oh, then that rejecting them make more sense, I was thinking of them as from local perspective like using tailscale vpn which allows shortdomains.

1234567890123456789012345678901234567890123456789012345678901234+x@example.com is longer than one of the RFC limits. Most software supports longer localparts today. However, I surveyed a corpus of about 500,000 addresses, and only software used longer localparts, no users. issue-1234567890123456789012345678901234567890123456789012345678901234@support.example.com is a good example. If Wordpress allows longer localparts, bots will use that to spam the contact forms, and it will not help people. My opinion is that it's probably best to reject these. Pete doesn't have an opinion on this address.

From what i have seen while searching and going through rfc 5321 & 2821, 64 seems to be the size that the local part of email address are allowed and I think that's still a bit far for from practically so ig sticking to 64 size looks better to me.

user@[IPv6:2001:db8::1]:8080 passes, and I think it should not. It's not deliverable via SMTP on the public internet. However, it's also not a unicode matter. I'm uncertain whether this PR should remove support for it.

Uh bit confused on this one, is this supposed to be valid email address as you said It's not deliverable via SMTP on the public internet. and right now both implementation is counting it as invalid.


Thanks for sharing the details again, I will update my implementation to reflect these. 😃

USERSATOSHI avatar Jul 22 '25 10:07 USERSATOSHI

I was confused about user@[IPv6:2001:db8::1]:8080. Looked at the table, switched tabs, wrote several paragraphs and by then I had forgotten what the table line says. That address is handled correctly IMO.

I'm not sure whether I get to do anything more on this today. But those addresses definitely need to go into a unit test, with comments as above.

arnt avatar Jul 22 '25 10:07 arnt

Just two quick comments:

Also invalid according to 5322 section 3 (the more restrictive) syntax.

While OK according to 5322, you'd have to strip the comments before sending to an SMTP (5321) server, and I suspect software that is getting these address from a WordPress form does not do that all the time, so I would avoid.

peteresnick avatar Jul 22 '25 10:07 peteresnick

Hi so I was checking on how to support quoted local emails and IPs on php side and updated my tests to follow the suggestions that were made during the discussion.

Did some updates on the patch and was able to make it work. If it is fine to support quoted local and IPs then I can add review comments.

image

PS: updated the table format so it is easier to read now 😅

Edit: Reading a bit more on WHATWG syntax not allowing quoted local-parts & IPv6 addresses, I am not sure anymore for supporting quoted locals and IP addresses. would need a feedback on this one.

cc: @arnt

USERSATOSHI avatar Jul 23 '25 19:07 USERSATOSHI

@arnt I sent you a PR with some improvements. Also good to see that @dmsnell has self-assigned this. It will get through soon.

SirLouen avatar Nov 05 '25 22:11 SirLouen

I'm grateful for the nitpicks. Seriously. I also wish this could be shipped; that would make life simpler for me.

I'd like to have a talk about growing/shrinking it. I'll keep an eye on Slack tomorrow. Have several meetings though, including one which I chair, so I might be pressed for time, depending on your timezone.

I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.

arnt avatar Nov 07 '25 00:11 arnt

I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.

Btw, also while I was reviewing this patch, I sent you a little PR with some little extra formatting tweaks for the tests part. Nothing big, just improving CS a bit and setting a wider group to test this ticket at once.

SirLouen avatar Nov 07 '25 00:11 SirLouen

@SirLouen I'll go and have some food now, and merge your PR either when I return or tomorrow morning. I really don't want to break anything due to jetlag and lack of focus.

arnt avatar Nov 07 '25 01:11 arnt

jetlag

having just jumped ahead by nine hours, I feel something similar @arnt — hope you recover quickly.


here are some things I reviewed yesterday:

  • WordPress appears to call sanitize_user() everywhere before detecting duplicate usernames, so changing that function should not accidentally allow duplicating user logins.
  • the schema.php indicates a 60-character login and 100-character email. for utf8mb4 configurations that means people will still be able to enter emails with “100 characters”.

I was left with a few more questions in general though:

  • for what reason are we adjusting sanitize_user() if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects.
  • what should we be doing if the database table cannot store the email addresses? for instance, if a table is configured as latin1 we have a problem. while that allows us to store the raw bytes just fine, unless we ensure that PHP and MySQL agree on what’s transferring through the query we will end up with corruption.
  • even if we ensure that we store non-ASCII characters properly, we end up with a length restriction for some email addresses based on the fact that VARCHAR(100) means 100 bytes for a latin1 table but up to 400 bytes for a utf8mb4 table. this is something to consider otherwise people might end up with some confusing error messages when submitting a valid email that looks like it’s shorter than WordPress and the database count it.
CREATE TABLE test_l1 (t varchar(100) character set latin1 collate latin1_swedish_ci) DEFAULT CHARSET=latin1;
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES ('a\u{1f170}b\u{1f171}')" );
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES (_latin1'a\u{1f170}b\u{1f171}')" );
MariaDB [wordpress]> SELECT * FROM test_l1;
+----------------------+
| t                    |
+----------------------+
| a?b?                 |
| a🅰b🅱           |
+----------------------+
2 rows in set (0.000 sec)

In the first case, we sent UTF-8 data to the database, which knows that the table is storing latin1, so it transparently re-encoded from UTF-8 to latin1, replacing unrepresentable characters as ?. In the second case we told the database that the UTF-8 bytes should be explicitly interpreted as latin1 and so it did store the raw bytes, however…

php > var_dump( $wpdb->get_col( 'SELECT t FROM test_l1' ) );
array(2) {
  [0]=>
  string(4) "a?b?"
  [1]=>
  string(20) "a🅰b🅱"
}
php > var_dump( $wpdb->get_col( 'SELECT CAST(t as binary) FROM test_l1' ) );
array(2) {
  [0]=>
  string(4) "a?b?"
  [1]=>
  string(10) "a🅰b🅱"
}

…unless we also do something on the read query side then the database will perform the same transparent re-encoding on read turning those UTF-8 bytes back into UTF-8 and thus double-encoding them.

This seems problematic to me in a new way because after this change we are allowing people to create accounts with Unicode login names and emails, but these database queries won’t fail; they will corrupt the data and someone might lose access to the account they just created, or to an account for which they just changed their email address. If we want to ensure this doesn’t happen we would need to trace the data through all places that access these and ensure we properly manage the character encoding in the SQL session, something that doesn’t tend to go well in WordPress due to the distributed nature of code contributions and plugins.

To me this highlights the need for deprecating non-UTF-8 support (Core-62172), though I guarantee we will be doing whatever we need to to support Unicode emails before we do that.


talk about growing/shrinking it

As named this PR seems very grand in purpose: what does it even mean to support RFC 6530? I feel like as a Core tracking ticket that is appropriate, but the work is likely to involve many independent PRs and changes, including some big semantic changes around usernames and emails, possibly involving some database schema changes.

For now, however, if we assume that this PR should move forward, here are some things I propose, pending further security review:

  • Drop the detection for cross-script spoofing. I would love to see some evidence that the risk is substantial enough that it’s worth the complexity, but I suspect that having rules which change based on the content of the input is going to be confusing for users and developers.
  • Missing from all of the changes are checks that the input is valid UTF-8. If we start by rejecting invalid UTF-8 then we can make a lot of assumptions and drop safety-checks without increasing risk. From my reading of the RFC I don’t see any allowance for addresses with invalid UTF-8.
  • Maintain existing behavior in the antispambot function. That is, let’s not skip non-ASCII characters just because we don’t see them as being present as threats. I think we can probably deprecate that function entirely and skip this, but we will create another discussion for that.

Further, I think we need to talk about what email characters are allowable. Our current set of rejected ASCII character includes octets which could be allowable email addresses. If our intention is to be restrictive, say to reject non-printable characters, or ambiguous whitespace, then we probably ought to see if we can replicate that experience for the upper Unicode planes too and not just turn away from those goals because it’s not ASCII.

Please enjoy your conference. There’s no rush; as this work will probably take a lot of cooperation. We’ll bring it up next week in the developer chat.

dmsnell avatar Nov 07 '25 21:11 dmsnell

@arnt

I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().

At a glance, would a polyfill for mb_str_split() be helpful here, so that we could remove the function_exists() checks and make the behavior consistent across all supported PHP versions?

SergeyBiryukov avatar Nov 12 '25 16:11 SergeyBiryukov

@SergeyBiryukov I would say that a polyfill makes sense if WP is to support PHP 7.2-3 for a while and a polyfill is simple. The support table makes it look as if support might be dropped in WP 7.0?

@dmsnell that's a long and very welcome message. Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.

arnt avatar Nov 12 '25 16:11 arnt