feedparser icon indicating copy to clipboard operation
feedparser copied to clipboard

published_parsed is wrong sometimes

Open thecrackofdawn opened this issue 9 years ago • 3 comments

In [27]: f['entries'][0]["published"]
Out[27]: u'2016/6/29 15:07:41'

In [28]: f['entries'][0]["published_parsed"]
Out[28]: time.struct_time(tm_year=2016, tm_mon=6, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=153, tm_isdst=0)

In [29]: datetime.datetime.fromtimestamp(time.mktime(f['entries'][0]["published_parsed"]))
Out[29]: datetime.datetime(2016, 6, 1, 0, 0)

as we can see, feedparser parsed "2016/6/29 15:07:41" into datetime.datetime(2016, 6, 1, 0, 0).
After reading the related code roughly, i found _parse_date_iso8601 was used to parse the date. The problem is that 2016/6/29 15:07:41 is not in iso8601 format. The re pattern used in _parse_date_iso8601 and the result returned:

In [20]: m = re.match("(?P<year>\d{4})(T?(?P<hour>\d{2}):(?P<minute>\d{2})(:(?P<second>\d{2}))?(\.(?P<fracsecond>\d+))?(?P<tz>[+-](?P<tzhour>\d{2})(:(?P<tzmin>\d{2}))?|Z)?)?", '2016/6/29 13:15:50')
In [22]: params = m.groupdict()
In [23]: params
Out[23]:
{'fracsecond': None,
 'hour': None,
 'minute': None,
 'second': None,
 'tz': None,
 'tzhour': None,
 'tzmin': None,
 'year': '2016'}

The re pattern used here can only get the year out and the code try to make assumption to the month and day. Why trying to make assumption? So the published_parsed can not be trusted?

thecrackofdawn avatar Jun 29 '16 08:06 thecrackofdawn

I solve it by doing this to meet my need.

thecrackofdawn avatar Jun 29 '16 09:06 thecrackofdawn

Did you opened a PR for your modification?

buhtz avatar Jun 26 '18 20:06 buhtz

@noMICROSOFTbuhtz No, I didn't. It may be not appropriate to solve the problem in my way.

thecrackofdawn avatar Jun 27 '18 08:06 thecrackofdawn