hdx-ckan
hdx-ckan copied to clipboard
Dataset dates for remote files gets stale
The resource shown below is being updated daily, but our date for it will stay the same, which is confusing.
We could:
- figure out a way to monitor remote resources to see if they have changed, though that may require different tests for different remote sources
- have a way for contributors to mark a dataset as "frequently updated" in which case we don't show a date. Although easier, this would be somewhat confusing to users and could also get stale if data updates stop.
@cjhendrix pls give more details about what resource is this issue about. Do we have this information in activity stream displayed for dataset owning the resource?
This applies to remotely hosted OSM extracts which are being updated every 30 minutes, such as:
https://data.hdx.rwlabs.org/dataset/nepal-openstreetmap-extracts-roads https://data.hdx.rwlabs.org/dataset/nepal-openstreetmap-extracts-buildings https://data.hdx.rwlabs.org/dataset/nepal-openstreetmap-extracts-places
I currently go in and manually update the date each day. Don't know best way to deal with this. Downloading a 200MB+ file to hash and compare to existing hash seems easy to code but inefficient.
With OGC services (such as GeoNode), you could sync with a dataset with a capabilities doc or CSW endpoint at an interval and specify which fields to update. However, for datasets without corresponding capabilities docs, it's not as straight forward.
There's no structured field for "frequency" in the UI, so that might be a first step. I currently add the frequency to the description.
ST asked for this to be done "soon" in our discussion on the FAQ text. Putting it into sprint candidates. Tagging @amcguire62.
We can handle freshness with a separate monitoring system that works like this:
This should definitely run on a separate VM. Over the week (probably with one batch nightly), the monitoring system would download each dataset metadata record once and compare its update frequency with the last date it was checked; if a check is due, then we download any resources that have external URLs and compare them (simple Unix cmp(1) command) with the last version we downloaded. If it's changed, update the date.
@brew mentions the CKAN deadoralive extension as another possible model to follow.
Took a quick look at the deadoralive source and, despite what it says, it looks pretty synchronus. Could be an issue at some point.
Thanks for checking that, @reubano — I've seen similar things in Drupal, etc, so I can imagine how it's set up.
In any case, as a basic architectural principal, we should never turn down a chance for horizontal scaling when the cost is this low; there's no reason to run these housekeeping batch tasks inside the main CKAN process, since they're not part of the interactive user experience, while there are strong reasons for running batch jobs externally:
- Performance: not competing for the same CPU/memory/diskspace with the Web layer (except for those required to respond to API calls).
- Stability: a bug in the housekeeping/batch code won't cause a crash or security hole in the HDX web site.
- Maintainability: releases to the batch housekeeping code don't have to be tightly coupled to releases of the Web site.