data.gov
data.gov copied to clipboard
WAF timestamp extract optimization
User Story
In order to efficiently compare waf records, datagov wants to incorporate the timestamp of the record(s).
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
- [ ] GIVEN harvest.py
AND a waf source url WHEN the waf traversal occurs
THEN the timestamp of the record should be included in theRecord
instance
AND used in the comparison function
Background
- we are responsible for harvesting records from a waf
- in order to be more efficient with our comparison we want to consider the timestamp of the files. the idea is to reduce the work required for a given waf source by only downloading what we need which means fewer network calls and fewer opportunities for something wrong to happen.
- this ticket assumes a date/time stamp is included on DB harvest record read
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch
- update traverse_waf to get the timestamp of the files
- update compare to use the waf record timestamp instead of the hash
- update order of functions in extract and compare
-
traverse_waf
- get all record urls
-
compare
- if the source is a waf then we want to compare the timestamp of the records not the hash
- label the file url to indicate if it needs to be downloaded
-
download_waf
- skip if file url doesn't need to be downloaded
-
It could be a challenge to scrap timestamp out of a WAF list since different web servers (or version) have different ways to show timestamps. Here is how ckanext-spatial does it.