hdx-ckan Dataset dates for remote files gets stale

The resource shown below is being updated daily, but our date for it will stay the same, which is confusing.

We could:

figure out a way to monitor remote resources to see if they have changed, though that may require different tests for different remote sources
have a way for contributors to mark a dataset as "frequently updated" in which case we don't show a date. Although easier, this would be somewhat confusing to users and could also get stale if data updates stop.

May 04 '15 11:05 cjhendrix

@cjhendrix pls give more details about what resource is this issue about. Do we have this information in activity stream displayed for dataset owning the resource?

May 11 '15 09:05 danmihaila

This applies to remotely hosted OSM extracts which are being updated every 30 minutes, such as:

https://data.hdx.rwlabs.org/dataset/nepal-openstreetmap-extracts-roads https://data.hdx.rwlabs.org/dataset/nepal-openstreetmap-extracts-buildings https://data.hdx.rwlabs.org/dataset/nepal-openstreetmap-extracts-places

I currently go in and manually update the date each day. Don't know best way to deal with this. Downloading a 200MB+ file to hash and compare to existing hash seems easy to code but inefficient.

With OGC services (such as GeoNode), you could sync with a dataset with a capabilities doc or CSW endpoint at an interval and specify which fields to update. However, for datasets without corresponding capabilities docs, it's not as straight forward.

There's no structured field for "frequency" in the UI, so that might be a first step. I currently add the frequency to the description.

May 12 '15 16:05 pjdufour

ST asked for this to be done "soon" in our discussion on the FAQ text. Putting it into sprint candidates. Tagging @amcguire62.

Jul 28 '15 14:07 cjhendrix

We can handle freshness with a separate monitoring system that works like this:

Update sequence diagram

This should definitely run on a separate VM. Over the week (probably with one batch nightly), the monitoring system would download each dataset metadata record once and compare its update frequency with the last date it was checked; if a check is due, then we download any resources that have external URLs and compare them (simple Unix cmp(1) command) with the last version we downloaded. If it's changed, update the date.

Nov 20 '15 16:11 davidmegginson

@brew mentions the CKAN deadoralive extension as another possible model to follow.

Nov 20 '15 17:11 davidmegginson

Took a quick look at the deadoralive source and, despite what it says, it looks pretty synchronus. Could be an issue at some point.

Nov 23 '15 11:11 reubano

Thanks for checking that, @reubano — I've seen similar things in Drupal, etc, so I can imagine how it's set up.

In any case, as a basic architectural principal, we should never turn down a chance for horizontal scaling when the cost is this low; there's no reason to run these housekeeping batch tasks inside the main CKAN process, since they're not part of the interactive user experience, while there are strong reasons for running batch jobs externally:

Performance: not competing for the same CPU/memory/diskspace with the Web layer (except for those required to respond to API calls).
Stability: a bug in the housekeeping/batch code won't cause a crash or security hole in the HDX web site.
Maintainability: releases to the batch housekeeping code don't have to be tightly coupled to releases of the Web site.

Nov 23 '15 14:11 davidmegginson

@davidmegginson agreed. deadoralive is def better than the current situation. A few options for future consideration are uptime or, if it must be python, twisted/tornado/grequests.

Nov 23 '15 15:11 reubano

hdx-ckan hdx-ckan copied to clipboard

Dataset dates for remote files gets stale

hdx-ckan
hdx-ckan copied to clipboard