feedparser
feedparser copied to clipboard
published_parsed is wrong sometimes
In [27]: f['entries'][0]["published"]
Out[27]: u'2016/6/29 15:07:41'
In [28]: f['entries'][0]["published_parsed"]
Out[28]: time.struct_time(tm_year=2016, tm_mon=6, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=153, tm_isdst=0)
In [29]: datetime.datetime.fromtimestamp(time.mktime(f['entries'][0]["published_parsed"]))
Out[29]: datetime.datetime(2016, 6, 1, 0, 0)
as we can see, feedparser parsed "2016/6/29 15:07:41" into datetime.datetime(2016, 6, 1, 0, 0).
After reading the related code roughly, i found _parse_date_iso8601 was used to parse the date. The problem is that 2016/6/29 15:07:41 is not in iso8601 format.
The re pattern used in _parse_date_iso8601 and the result returned:
In [20]: m = re.match("(?P<year>\d{4})(T?(?P<hour>\d{2}):(?P<minute>\d{2})(:(?P<second>\d{2}))?(\.(?P<fracsecond>\d+))?(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?", '2016/6/29 13:15:50')
In [22]: params = m.groupdict()
In [23]: params
Out[23]:
{'fracsecond': None,
'hour': None,
'minute': None,
'second': None,
'tz': None,
'tzhour': None,
'tzmin': None,
'year': '2016'}
The re pattern used here can only get the year out and the code try to make assumption to the month and day. Why trying to make assumption? So the published_parsed can not be trusted?
I solve it by doing this to meet my need.
Did you opened a PR for your modification?
@noMICROSOFTbuhtz No, I didn't. It may be not appropriate to solve the problem in my way.