feedparser published_parsed is wrong sometimes

published_parsed is wrong sometimes

Open thecrackofdawn opened this issue 9 years ago • 3 comments

In [27]: f['entries'][0]["published"]
Out[27]: u'2016/6/29 15:07:41'

In [28]: f['entries'][0]["published_parsed"]
Out[28]: time.struct_time(tm_year=2016, tm_mon=6, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=153, tm_isdst=0)

In [29]: datetime.datetime.fromtimestamp(time.mktime(f['entries'][0]["published_parsed"]))
Out[29]: datetime.datetime(2016, 6, 1, 0, 0)

as we can see, feedparser parsed "2016/6/29 15:07:41" into datetime.datetime(2016, 6, 1, 0, 0).
After reading the related code roughly, i found _parse_date_iso8601 was used to parse the date. The problem is that 2016/6/29 15:07:41 is not in iso8601 format. The re pattern used in _parse_date_iso8601 and the result returned:

In [20]: m = re.match("(?P<year>\d{4})(T?(?P<hour>\d{2}):(?P<minute>\d{2})(:(?P<second>\d{2}))?(\.(?P<fracsecond>\d+))?(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?", '2016/6/29 13:15:50')
In [22]: params = m.groupdict()
In [23]: params
Out[23]:
{'fracsecond': None,
 'hour': None,
 'minute': None,
 'second': None,
 'tz': None,
 'tzhour': None,
 'tzmin': None,
 'year': '2016'}

The re pattern used here can only get the year out and the code try to make assumption to the month and day. Why trying to make assumption? So the published_parsed can not be trusted?

Jun 29 '16 08:06 thecrackofdawn

I solve it by doing this to meet my need.

Jun 29 '16 09:06 thecrackofdawn

Did you opened a PR for your modification?

Jun 26 '18 20:06 buhtz

@noMICROSOFTbuhtz No, I didn't. It may be not appropriate to solve the problem in my way.

Jun 27 '18 08:06 thecrackofdawn

feedparser feedparser copied to clipboard

published_parsed is wrong sometimes

feedparser
feedparser copied to clipboard