open-semantic-etl
open-semantic-etl copied to clipboard
ETL Web: Parse last modification date from webserver
After upgrade to Python 3 with urllib problem with parsing last modification date from webserver like
Wed, 21 Jun 2017 11:35:20 +0000
The now used dateutil parser seems not to be able to parse it.
The old Python 2 library urllib2 was able to parse and return structured time by headers.getdate() ...
Is there a lib that can handle different web server timestamp formats? Using time.strptime() for a special format would be limited to only this special format.
The HTTP header "date" with the format Wed, 21 Jun 2017 11:35:20 GMT can be parsed without problems by dateutil parser.
So this affects only the HTTP header "last-modified"
Will try https://github.com/scrapinghub/dateparser
How about (python 3.6.4):
from datetime import datetime ddate = datetime.strptime('Wed, 21 Jun 2017 11:35:20 +0000', '%a, %d %b %Y %H:%M:%S %z')
The problem is not parsing one special format, but all different possible formats.
Seems this tool could provide good heuristic results or solutions:
https://github.com/adbar/htmldate
Another lib to evaluate: https://github.com/akoumjian/datefinder