bigbang
bigbang copied to clipboard
Invalid timestamps prevent Archive initialization
archive.py handles null dates by dropping them, but not malformed dates.
I got an uncaught exception
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 100-01-31 12:12:25
when trying to get an archive of a python.org mailing list.
archive.py line that threw it:
self.data['Date'] = pd.to_datetime(self.data['Date'], utc=True)
Workaround: Caught the exception and set Date to None, which lets entries with malformed date fields be treated the same as entries without a date field (dropped).
Is this issue worth a PR with my fix? Or is the exception preferred so people know the archive has wonky dates?
Thanks so much for catching this!
A PR with your fix would be great! Though you raise a good question about what to do with wonky dates.
I think maybe an ideal solution would have a "justworks" argument that, when set to True, catches exceptions and does something reasonable.
I'm also running into this problem. I thought we could fix it by setting the errors='coerce' option (which would create NaT for every instance where the datetime can't be figured out), but I'm struggling a bit with my implementation.