Support Datetime other than those specified as 14-digits
The WARC 1.1 spec allows for more precise datetimes. These should be supported in the replay system. Does any tool exist that will generate these yet? If not, some sample data can be fabricated.
The further precision does not be present in the link above. What's the BnF link?
Also see https://github.com/iipc/warc-specifications/pull/21.
First line WARC/1.1 causes an exception in the iterator we currently reuse from pywb to quickly invalidate the WARC and not proceed with processing.
Per the WARC/1.1 spec and https://github.com/iipc/warc-specifications/pull/21, date strings like 2014-01 are legal but currently breaks the indexer with:
Traceback (most recent call last):
File "/usr/local/bin/ipwb", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 17, in main
args = checkArgs(sys.argv)
File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 151, in checkArgs
results.func(results)
File "/usr/local/lib/python2.7/site-packages/ipwb/__main__.py", line 32, in checkArgs_index
debug=args.debug)
File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 141, in indexFileAt
warcFileFullPath, **encryptionAndCompressionSetting)
File "/usr/local/lib/python2.7/site-packages/ipwb/indexer.py", line 179, in getCDXJLinesFromFile
for i in iterForCounting(fhForCounting):
File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 543, in __call__
for entry in entry_iter:
File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 379, in create_record_iter
entry = self.parse_warc_record(record)
File "/usr/local/lib/python2.7/site-packages/pywb/warc/archiveiterator.py", line 465, in parse_warc_record
get_header('WARC-Date'))
File "/usr/local/lib/python2.7/site-packages/pywb/utils/timeutils.py", line 122, in iso_date_to_timestamp
return datetime_to_timestamp(iso_date_to_datetime(string))
File "/usr/local/lib/python2.7/site-packages/pywb/utils/timeutils.py", line 40, in iso_date_to_datetime
the_datetime = datetime.datetime(*map(int, nums))
TypeError: Required argument 'day' (pos 3) not found
...based on 6d219f5df29311c1d5b3efce006145b54a9cb0d8.
Added a sample (variableSizedDates) WARC that I believe conforms to the 1.1 standard with variable length datetime strings.
Encountered this again in testing, current master (73f136f48334ed9ca09c50413a2a9f0f51d251b0):
% ipwb index samples/warcs/variableSizedDates.warc
Traceback (most recent call last):eSizedDates.warc: 1/5
File "/usr/local/bin/ipwb", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 19, in main
args = checkArgs(sys.argv)
File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 167, in checkArgs
results.func(results)
File "/usr/local/lib/python3.7/site-packages/ipwb/__main__.py", line 34, in checkArgs_index
debug=args.debug)
File "/usr/local/lib/python3.7/site-packages/ipwb/indexer.py", line 174, in indexFileAt
warcFileFullPath, **encryptionAndCompressionSetting)
File "/usr/local/lib/python3.7/site-packages/ipwb/indexer.py", line 291, in getCDXJLinesFromFile
record.rec_headers.get_header('WARC-Date'))
File "/usr/local/lib/python3.7/site-packages/ipwb/util.py", line 165, in iso8601ToDigits14
"%Y-%m-%dT%H:%M:%SZ")
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_strptime.py", line 577, in _strptime_datetime
tt, fraction, gmtoff_fraction = _strptime(data_string, format)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/_strptime.py", line 359, in _strptime
(data_string, format))
WARC 1.0 mandates 14-digit date for the WARC-Date field:
A 14-digit UTC timestamp formatted according to YYYY-MM-DDThh:mm:ssZ, described in the W3C profile of ISO8601 [W3CDTF]. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event (see section 5.7) shall use the same WARC-Date, even though the times of their writing will not be exactly synchronized.
WARC-Date = "WARC-Date" ":" w3c-iso8601 w3c-iso8601 = YYYY-MM-DDThh:mm:ssZ All records shall have a WARC-Date field.
WARC 1.1 allows for other variants, e.g., 2014-01
The WARC-Date is a UTC timestamp as described in the W3C profile of ISO 8601:1988 [W3CDTF], for example YYYY-MM-DDThh:mm:ssZ. The timestamp shall represent the instant that data capture for record creation began. Multiple records written as part of a single capture event (see section 5.7) shall use the same WARC-Date, even though the times of their writing will not be exactly synchronized.
WARC-Date may be specified at any of the levels of granularity described in [W3CDTF]. If WARC-Date includes a decimal fraction of a second, the decimal fraction of a second shall have a minimum of 1 digit and a maximum of 9 digits. WARC-Date should be specified with as much precision as is accurately known. This document recommends no particular algorithm for access software to choose a record by date when an exact match is not available.
WARC-Date = "WARC-Date" ":" w3c-iso8601 w3c-iso8601 = <a UTC timestamp formatted according to [W3CDTF]> All records shall have a WARC-Date field.
See Annex A for examples on usage of WARC-Date fields.
It seems more flexible to simply read and interpret the date instead of referring to which version of the spec to which the WARC should adhere. As of now, iso8601ToDigits14() in util.py assumes iso8601 compliance, hence WARC 1.0.
Given the rationale for conversion is from ISO8601 to 14-digit datetime, some options:
- Assume undefined aspects of the datetime, e.g., 2014-01 to 20140101000000
- Adapt to allow for fuzziness.
The former seems more straightforward but instills perhaps unintended assumptions. Fuzziness is inherent in datetimes, as time is continuous, e.g., the millisecond discussion for WARCs. If we read a fuzzy datetime from a WARC and go with option 2, will it be compatible with storing this value in a CDXJ record with no assumptions of the datetime beyond what is specified.
@ibnesayeed, can you provide some insight/feedback/commentary for this?
The key here is ISO8601 with "as much precision as is accurately known."
I cannot locate a module to accomplish this but a series of tests (e.g., regex) with the highest level of granularity (with 9 digits following the second) all the way down to simply year is an approach. This starting point might seem wasteful, given the more common ISO8601 length including up to seconds.
For Python:
%Y-%m-%dT%H:%M:%SZ %Y-%m-%dT%H:%MZ %Y-%m-%dT%HZ %Y-%m-%d %Y-%m %Y %Y-%m-%dT%H:%M:%S.[0-9]{1-9}Z
With the last version not quite correct (but you, future person, hopefully get the gist).
9cd23ba addresses some of this but I have yet to match the fraction-of-a-second example in that WARC:
import datetime
datetime.datetime.strptime('%Y-%m-%dT%H:%M:%S.%fZ','2014-02-10T00:00:01.000000002Z')
ValueError: time data '%Y-%m-%dT%H:%M:%S.%fZ' does not match format '2014-02-10T00:00:01.000000002Z'
There could be two possible approaches here:
- Identify all the potential datetime formats that are allowed and try to normalize them in one canonical form that is not lossy and is easier for lookup
- Gradually recognize and accommodate more formats that are in use, a canonical form for internal use will be helpful in this approach as well
We also need to figure out what is URI format we would want to support in the replay.
The parameters above are backward, the format string should be second. This works:
datetime.datetime.strptime('2014-02-10T00:00:01.000000Z','%Y-%m-%dT%H:%M:%S.%fZ')
datetime.datetime(2014, 2, 10, 0, 0, 1)
Note, however, that %f read six 0-padded digits. The WARC/1.1 spec says:
the decimal fraction of a second shall have a minimum of 1 digit and a maximum of 9 digits.
This is problematic and conflicting with the sub-second %f portion of the format string.
W3CDTF says:
s = one or more digits representing a decimal fraction of a second.
%f might be insufficient, as it expects six digits and WARC-Dates can have 1-9 digits. Is there a format portion (akin to %f) that allows for this specification?
This level of precision is unlikely but allowable per WARC/1.1, so we need a special case for compliance. One option is to first check compliance with:
dt = '2014-02-10T00:00:01.123456789Z'
dt_f = f'{dt[:26]}dt[-1:]'
datetime.datetime.strptime(dt_f, '%Y-%m-%dT%H:%M:%S.%f')
...then parse out dt[27:-1], append it to dt[21:26], check it is all digits, and if so, assign it to the final value of the datetime object.
datetime.datetime.microsecond is not write-able, so the more precise value cannot simply be set after parsing.
b76135a73a17afd1bb6f1ebd9206ce7265ad90ff adds support for generating more precise, solely digit-based date strings. These become present in the CDXJs generated, for example:
% ipwb index samples/warcs/variableSizedDates.warc
!context ["http://tools.ietf.org/html/rfc7089"]
!meta {"generator": "InterPlanetary Wayback v.0.2020.06.18.1933", "created_at": "2020-06-19T14:37:49.991232"}
us,memento)/ 20140101000000 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmX4gE6SdJK8v67XikqQFJrac4xaqB5kwsgona2nH9hZwm", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmXQB6e2aB7VRaA4CK5H33sTfVC6GxNd1JtSgCaWVuUbfj", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/QmYWRfaHFcN7ygLUiiKEF6ELApMbdhv7K3zRtrz5rog83U", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
us,memento)/ 20140210000001000000002 {"locator": "urn:ipfs/QmNQX5gEjbEPModBHXb6w4EWveLkZ57uEC9Kzh8bho7QmL/Qmb8q1BFPws4ZNhL9MczY9tb4mWEPdV41LNuXD6oMkvzcw", "status_code": "200", "mime_type": "text/html", "original_uri": "http://memento.us/"}
More adjustments may need to be made to ensure replay can handle the potentially longer date strings (see last line above). This issue is complete but I would like to investigate the end other of using the CDXJ files with long date strings.
As suspected, when replaying the CDXJ above and accessing any memento, the digits14ToRFC1123() method is called and the parallel datetime.datetime.strptime(digits14, '%Y%m%d%H%M%S') within throws a ValueError.
- [ ] Check how replay handles CDXJ entries that are > (and <?) 14-digits, as generated from the indexer in 60de785.