open-semantic-etl icon indicating copy to clipboard operation
open-semantic-etl copied to clipboard

ETL Web: Parse last modification date from webserver

Open opensemanticsearch opened this issue 8 years ago • 5 comments

After upgrade to Python 3 with urllib problem with parsing last modification date from webserver like

Wed, 21 Jun 2017 11:35:20 +0000

The now used dateutil parser seems not to be able to parse it.

The old Python 2 library urllib2 was able to parse and return structured time by headers.getdate() ...

Is there a lib that can handle different web server timestamp formats? Using time.strptime() for a special format would be limited to only this special format.

opensemanticsearch avatar Jun 21 '17 11:06 opensemanticsearch

The HTTP header "date" with the format Wed, 21 Jun 2017 11:35:20 GMT can be parsed without problems by dateutil parser.

So this affects only the HTTP header "last-modified"

Mandalka avatar Jun 21 '17 11:06 Mandalka

Will try https://github.com/scrapinghub/dateparser

Mandalka avatar Apr 24 '18 18:04 Mandalka

How about (python 3.6.4):

from datetime import datetime ddate = datetime.strptime('Wed, 21 Jun 2017 11:35:20 +0000', '%a, %d %b %Y %H:%M:%S %z')

clamor avatar Apr 30 '18 13:04 clamor

The problem is not parsing one special format, but all different possible formats.

Seems this tool could provide good heuristic results or solutions:

https://github.com/adbar/htmldate

Mandalka avatar Jan 16 '20 18:01 Mandalka

Another lib to evaluate: https://github.com/akoumjian/datefinder

opensemanticsearch avatar Jan 30 '20 14:01 opensemanticsearch