incubator-ponymail
incubator-ponymail copied to clipboard
Bug: emails allocated to wrong month
Emails which are sent within a few hours of the end of a month may be incorrectly allocated to the subsequent month.
For example, the earliest emails in the following mbox should be in the May 2015 chunk.
https://lists.apache.org/[email protected]:2015-6
The displayed timestamp shows the date 2015-05-31 (assuming that the local TZ is no further east than GMT+1)
These mails were uploaded from a file, if that makes a difference.
It looks like the mails were allocated to a month based on a local timezone at least 2 hours before GMT. It does not make sense to use the local timezone for this. The database should work in UTC only. If necessary, the display can show times using the local timezone, but the underlying data should only be stored in UTC.
I found one place where localtime is used in the backend code:
https://github.com/apache/incubator-ponymail/blob/master/tools/archiver.py#L274
This would probably cause the upload issue.
See also https://issues.apache.org/jira/browse/INFRA-12079
I also found a non-uploaded mail that has been allocated to the wrong month:
https://lists.apache.org/thread.html/79e2e6a0df70efc206e8e0124bd52d0302c52b50775d5aaa2cff108d@1464733997@%3Cuser.commons.apache.org%3E
The date in the email is
Date: Tue, 31 May 2016 22:33:17 +0000
The archiver code now uses UTC
Example of an early imported mail.
The source [1] has the following date:
Date: Sun, 31 May 2015 22:19:41 -0000
The Permalink page [2] has the following info:
Date: 2015-05-31 23:19 (-0000)
This is clearly wrong, but may be a GUI-only issue [Later: yes, the problem is that the GUI was converting the time to a local time; this has been fixed]
The summary info [3] shows the following:
mid": "fba1fa838d345c3b30b3db543425419a85ffde5f89ed2278063cf0c6@1433110781@<notifications.commons.apache.org>", "date": "2015/06/01 00:19:41", epoch": 1433110781,
The epoch value corresponds to 2015-05-31 22:19:41 UTC
So the epoch agrees with the source mail. The date in the mbox record is two hours adrift, and is the reason why the message appears in the wrong month. [Later: this implies that the local TZ on the importing box was 2 hours different from UTC at the time]
[1] https://lists.apache.org/api/source.lua/fba1fa838d345c3b30b3db543425419a85ffde5f89ed2278063cf0c6@1433110781@%3Cnotifications.commons.apache.org%3E
[2] https://lists.apache.org/thread.html/fba1fa838d345c3b30b3db543425419a85ffde5f89ed2278063cf0c6@1433110781@%3Cnotifications.commons.apache.org%3E
[3] https://lists.apache.org/api/thread.lua?id=fba1fa838d345c3b30b3db543425419a85ffde5f89ed2278063cf0c6@1433110781@%3Cnotifications.commons.apache.org%3E
AFAIK, you can tell ES a timezone offset to correct this when querying for email.
That won't help, because the TZ which was used for the date field is not included in the string (if it were, this would be a non-issue). Also the TZ used to load the original mails is not the same as the TZ which is used now. Unless one knows the TZ one cannot tell ES what offset to use.
I think the code should use the epoch instead. Hopefully that always used UTC, but that needs to be checked.
However there remains the issue that the date fields in the mbox records use different TZs depending on when they were created. One solution might be to ignore them completely.
It looks as though the problem is fixed in the current code, because importing the same message generates the correct UTC date, i.e. "2015/05/31 22:19:41".
Note: it's not easy to use the 'epoch' field instead of the 'date' field, because the code makes extensive use of the relative date syntax supported by ES, e.g. +1m, -100d etc. This would be hard to match exactly in Lua. Also the problem now only exists for database entries that were created before the code was fixed.