Description

Refreshing of harvested records using refresh_harvested_records command doesn't work properly. Record corresponding to CSW-service is deleted from repository when trying to refresh harvested records

Environment

operating system: ubuntu trusty
Python version: 2.7
pycsw version: master
source/distribution
- [x] git clone
- [ ] DebianGIS/UbuntuGIS
- [ ] PyPI
- [ ] zip/tar.gz
- [ ] other (please specify):
web server
- [x] Apache/mod_wsgi
- [ ] CGI
- [ ] other (please specify):

Steps to Reproduce

I have deployed two separate instances of pycsw for harvesting process testing. Each instance uses it's own repository, which is postgres-database with postgis enabled.

On the first instance I have created 3 metadata records with identifiers "a", "b" and "c". After that, I have started harvesting using OWSlib:

csw = CatalogueServiceWeb('http://second-pycsw-instance-url/')
csw.harvest('http://first-pycsw-instance-url/', resourcetype='http://www.opengis.net/cat/csw/2.0.2')

After that, 4 records in second repository appeared: the first one corresponds to the service itself, and the remaining three records correspond to the records "a", "b" and "c". But when I am trying to refresh harvested records using corresponding pycsw-admin's command, first record disappears from repository and when I am trying to run this command again, I'm getting message "No harvested records".

Additional Information

Sep 15 '16 08:09 igor-chernikov

After some debugging I have found possible source of the problem. Look at this line:

https://github.com/geopython/pycsw/blob/master/pycsw/ogc/csw/csw2.py#L1263

Here it is supposed that record that describes the service will be first in the returned list. Although postgres does not guarantee, that records are in the same order, in which were inserted. Therefore, there is a possibility, that variable service_identifier will be set to identifier of one of the service records, not service itself. Further: https://github.com/geopython/pycsw/blob/master/pycsw/ogc/csw/csw2.py#L1282

parse_record function, when parsing a CSW-service, returns a list where the first item always describes the service itself. And if the incorrect service_identifier was set, then this code: https://github.com/geopython/pycsw/blob/master/pycsw/ogc/csw/csw2.py#L1320 will cause that the new generated identifier of the service will be replaced by identifier, that not belongs to service. Finally, correct service identifier will not be inserted to the list named ir, which is used to compare records from repo and parsed records, and service will be deleted.

After changing the code this way

identifier_attr = self.parent.context.md_core_model['mappings']['pycsw:Identifier']
type_attr = self.parent.context.md_core_model['mappings']['pycsw:Type']
service_result = filter(lambda res: getattr(res, type_attr) == 'service', results)[0]
service_identifier = getattr(service_result, identifier_attr)
service_results = results

the record coresponding to service will no longer be removed.

But after that I have noticed another strange behaviour. If new records have been added to harvested service, re-harvest of that service doesn't create new records in repository. I also have found possible source: https://github.com/geopython/pycsw/blob/master/pycsw/ogc/csw/csw2.py#L1341

During second harvesting of remote service, this code will lead to the fact that insertion of new records never occur, since the service identifier will always found and len(results) > 0 is always True. I have commented out this line and now harvesting works fine. Tested on the CSW-, WMS-, WFS- and WMTS-services.

Also, I found it strange, that the URL of the service is saved to the field named source, while the refresh_harvested_records relies on mdsource field. I replaced all references to the source-field by the references to the mdsource, but I'm not sure of the correctness of this. @tomkralidis, can you please explain the purpose of each of this fields, take a look at my changes here: https://github.com/igor-chernikov/pycsw/commit/fdfd4fe3b0471ebc95e5c7d541337444d7d84010 and check to see if I missed something?

Sep 16 '16 09:09 igor-chernikov

@igor-chernikov I cannot reproduce the issue at all. Workflow against an empty pycsw instance:

from owslib.csw import CatalogueServiceWeb
csw = CatalogueServiceWeb('http://localhost:8000/')
csw.harvest('http://demo.pycsw.org/cite/csw', resourcetype='http://www.opengis.net/cat/csw/2.0.2')

Then against the http://localhost:8000 pycsw instance:

pycsw-admin.py -f default.cfg -c refresh_harvested_records

Does this work for you or am I missing something?

Oct 04 '16 11:10 tomkralidis

@tomkralidis, your scenario works fine if I am using sqlite as db-backend, but with postgresql+postgis the issue still exists. Furthermore, to make refresh_harvested_records command work with postgres I had to change this line: https://github.com/geopython/pycsw/blob/master/pycsw/core/admin.py#L409 to

count, records = repos.query(constraint={'where': 'mdsource != "'local'", 'values': []})

because postgres doesn't allow string literals to be enclosed in double quotes.

Oct 06 '16 05:10 igor-chernikov