Parsing dates using Windows timezone names and other non-standard timezones
A few of the emails I processed with your library have issues parsing the date field. I try to get the DateTime object with DateHeader::getDateTime, but gives me null. I assume it has problems processing invalid or non-abbreviated time zones.
These are the date fields that have issues:
- Wed, 24 Jul 2019 03:05:23 Pacific Daylight Time
- 21 Mar 2020 10:38:02 UT
- Fri, 26 Jul 2019 12:58:55 Pacific Daylight Time
- Thu, 13 Feb 2020 05:36:50 Pacific Standard Time
Hi @Pantalaim0n --
For the most part we're relying on PHP's createFromFormat with either DateTime::RFC822 or DateTime::RFC2822 formats... actually looking into this, that seems redundant the formats are identical.
For reference, here's a link to the RFC. https://www.ietf.org/rfc/rfc2822.txt -- the official format supports only +/- 4 DIGITS, e.g. "+0200"
Anyway, we did build a couple of exceptions to cover some small issues (covering "UT" instead of "UTC" for example) so far.
Looking at the examples you provided though, it looks like it's just two variations. The first looks like RF822 but incorrectly using a timezone name. From investigating, that looks like it's a Windows standard, and php doesn't seem to include support for them, and uses a different standard (IANA) for timezone names.
The 2nd is just missing the day of week. The 2nd one would be easier to support, but I have a few questions:
- What system is producing these date formats? Is it just an erroneous system one individual built, or something produced by a commonly-used mail client or software library?
- And based on that, should we go out of our way to support them?
The issue of course being that we need to have a reasonable balance between supporting widespread formats (even erroneous ones) and not having a hundred conditions to try and support every single implementation of a mail sending function.
It seems that the examples with the full timezone names are all from hotmail.com and live.com accounts, so there's a pattern. The one without the DOW is from a local webshop, couldn't get much info from the source.
When I was still using my own library, I tried to overcome the timezone name issue by abbreviating the full timezone name before passing it to PHP's datetime, which worked OK. My lib had problems in other areas so I prefer to use yours :)
@Pantalaim0n
That's really interesting. I don't have a live.com address, but aren't they both 'hotmail' nowadays anyway? Logging in to hotmail.com ends me up at outlook.live.com. I don't specifically have an email address @ live.com to test with is my point.
Anyway, I tested hotmail.com and it seems correct. This is the Date header I got: Date: Tue, 9 Jun 2020 16:25:26 +0000
Also tested Outlook 2016, and it was the same: Date: Tue, 9 Jun 2020 16:36:59 +0000.
If you know how such an email was generated, could you send me one to zb.github at mailbox.org? Please let me know what client/web-client/system was used as well. I'd like to see how Thunderbird and different mail clients handle mail like that too though so I can follow suit.
I see what you're saying about abbreviating those timezone names and using that. I wonder if it covers all cases though.