data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

WAF timestamp extract optimization

Open rshewitt opened this issue 10 months ago • 1 comments

User Story

In order to efficiently compare waf records, datagov wants to incorporate the timestamp of the record(s).

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • [ ] GIVEN harvest.py
    AND a waf source url WHEN the waf traversal occurs
    THEN the timestamp of the record should be included in the Record instance
    AND used in the comparison function

Background

  • we are responsible for harvesting records from a waf
  • in order to be more efficient with our comparison we want to consider the timestamp of the files. the idea is to reduce the work required for a given waf source by only downloading what we need which means fewer network calls and fewer opportunities for something wrong to happen.
  • this ticket assumes a date/time stamp is included on DB harvest record read

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • update traverse_waf to get the timestamp of the files
  • update compare to use the waf record timestamp instead of the hash
  • update order of functions in extract and compare
    • traverse_waf
      • get all record urls
    • compare
      • if the source is a waf then we want to compare the timestamp of the records not the hash
      • label the file url to indicate if it needs to be downloaded
    • download_waf
      • skip if file url doesn't need to be downloaded

rshewitt avatar Apr 23 '24 17:04 rshewitt

It could be a challenge to scrap timestamp out of a WAF list since different web servers (or version) have different ways to show timestamps. Here is how ckanext-spatial does it.

FuhuXia avatar May 06 '24 17:05 FuhuXia