pycsw icon indicating copy to clipboard operation
pycsw copied to clipboard

Metadata version control with git

Open isedwards opened this issue 11 years ago • 8 comments

Try using Git repository as a pycsw backend. CSW providing a search interface and Git as an alternative to CSW-T

Raised here: http://osgeo-org.1560.x6.nabble.com/General-CSW-questions-tp5067534p5067661.html

isedwards avatar Jul 23 '13 07:07 isedwards

I prefer git for most things, but perhaps fossil-scm is also worthy of attention on this ticket (with its single file repository based on sqlite database and immutable history)?

The geonetwork experience (using svn) is here: "Not all records in GeoNetwork are tracked as the compute and systems admin cost of this tracking for every record, particularly in large catalogs, is too high." http://geonetwork-opensource.org/manuals/trunk/eng/users/managing_metadata/versioning/index.html

isedwards avatar Jul 23 '13 07:07 isedwards

Git seems like a good first step to implement an scm backend design pattern, which we can then apply to fossil-scm, svn, etc.

Some options/thinking out loud:

  • manage metadata in Git, and simply have a process/script to update the underlying pycsw repository from Git periodically, or as a post-commit hook. People could interact with Git by other means?
  • enhance CSW-T to additionally transact with the scm (thereby having a managed copy of the metadata in Git as well as the CSW repository), when a user does insert/update/delete
  • Implement CSW extensions kind of like GeoServer does (http://geoserver.org/display/GEOS/Versioning+WFS+-+Extensions), with GetLog, GetDiff, and enhancing GetRecordById to fetch by a given version (recordVersion)

Auth: I haven't given much thought yet to access control against specific elements, however it would be best to leverage an auth mechanism and use it as opposed to creating one inline

Migrations: the way the pycsw repository works, it is kind of agnostic to the structure of metadata records per se, but we should look into DB migrations regardless, for times where the underlying model itself changes.

tomkralidis avatar Jul 27 '13 12:07 tomkralidis

The first bullet seems very tractable, and would make for a great demonstration of the idea.

The second point would be required in the end, although honestly the major benefit of a git backend would be that you could manage the metadata content without CSW-T.

Third point is less intriguing to me -- again, less interested in CSW-based access to versioning. CSW's primary focus should be on search and discovery, and we can let real-life version control systems do the version control.

It would also be worth exploring Git as a more efficient mechanism for harvesting than CSW's protocol.

What would be stellar would be a git repo as a replacement for, not in addition to, the spatial database, but then you would certainly need some other mechanism for indexing... Maybe something like CouchDB is another backend to consider?

rclark avatar Aug 21 '13 16:08 rclark

Mercurial would also be a good choice as a back-end, since it is written in Python and is very similar to Git.

Regarding CouchDB, there is an open issue #120 :)

kalxas avatar Aug 21 '13 16:08 kalxas

@rclark good points here. I think a Git repo as the backend is a good next step.

Backends in pycsw are extensible. So something like pycsw/plugins/repository/git/git.py would be required, with the same setup/signatures as https://github.com/geopython/pycsw/blob/master/pycsw/plugins/repository/geonode/geonode_.py or https://github.com/geopython/pycsw/blob/master/pycsw/plugins/repository/odc/odc.py, adding insert, update, delete functions which would be the CSW-T functions to interact with Git.

I think this would be very easy to do for Git transactions, with a few config switches to detect it's a git backend, as well as u/p credentials.

The question then becomes how do we index and make the repository searchable.

Some options / further thinking out loud:

  • one could use, say, the GitHub API to search a repository but this would only loosely work for freetext style searching, so you would have to post-process the API response for finer grained searching like CSW can do (i.e. dc:title = 'foo'). This also goes for SQFQL spatial predicates
  • use a parallel indexing system like CouchDB. This would also require SFSQL spatial predicate support. Anyone know if GeoCouch support this?

tomkralidis avatar Aug 22 '13 13:08 tomkralidis

GeoNetwork and ESRI Geoportal both utilize lucene for indexing if I'm not mistaken. I think CouchDB has validity as its own backend for pycsw, but maybe not so much for this purpose.

Even more thinking out loud

  • Wouldn't want to rely on GitHub API unless you were explicitly making it a "GitHub" and not just a "Git" backend.
  • Lucene (or something along those lines) can abstract the search/indexing away from your backend implementation, and that's really intriguing, but at the same time one of the great things about pycsw is how light-weight it is in comparison to the other Java-based CSW servers. For a file-based backend though, you more or less have to rely on some other piece of the stack to index/search I guess?

rclark avatar Aug 22 '13 16:08 rclark

@rclark thanks for the info. Agreed, lightweight is a rule of pycsw.

Has anyone tried whoosh (http://whoosh.ca)? From what I can see, pure Python index/search, and I think it would be a great fit. The only thing is that it doesn't do spatial. What would be really cool is for Whoosh to support Shapely (even if it's not PP, optional spatial support).

tomkralidis avatar Aug 22 '13 18:08 tomkralidis

Update: External git workflow is being used in ESA's Open Science Catalogue https://opensciencedata.esa.int/ with pycsw as the Catalogue backend.

Records are stored/manipulated on GitHub and there is a hook that triggers pycsw harvesting from gihub pages to synchronize the records in the db.

kalxas avatar Oct 09 '22 08:10 kalxas