pycsw
pycsw copied to clipboard
csw service pagination results are random when sort/order by is not provided by the clients
The paginated results from csw endpoint is using random order when listing records, which explains the same record appears in multiple page requests. With sortby added in, the records are listed in consistent order.
Can this be addressed on the server side, rather than relying on client side adding sortby to make it right?
Here to how to demo the issue with command line request.
The following command it to get the 14000th record.
Without sortby, the record is random for each request.
curl -X POST -d @noaa.xml "https://data.noaa.gov/csw?request=GetCapabilities&service=CSW" --header "Content-Type:text/xml”
With sortby, the records are same all the time.
curl -X POST -d @noaa.sorted.xml "https://data.noaa.gov/csw?request=GetCapabilities&service=CSW" --header "Content-Type:text/xml"
Save this one as file noaa.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" outputSchema="http://www.isotc211.org/2005/gmd" outputFormat="application/xml" version="2.0.2" service="CSW" resultType="results" startPosition="14000" maxRecords="1" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd">
<csw:Query typeNames="csw:Record">
<csw:ElementSetName>brief</csw:ElementSetName>
</csw:Query>
</csw:GetRecords>
Save this one as file noaa.sorted.xml
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords xmlns:ogc="http://www.opengis.net/ogc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" outputSchema="http://www.isotc211.org/2005/gmd" outputFormat="application/xml" version="2.0.2" service="CSW" resultType="results" startPosition="14000" maxRecords="1" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd">
<csw:Query typeNames="csw:Record">
<csw:ElementSetName>brief</csw:ElementSetName>
<ogc:SortBy>
<ogc:SortProperty>
<ogc:PropertyName>apiso:Title</ogc:PropertyName>
<ogc:SortOrder>DESC</ogc:SortOrder>
</ogc:SortProperty>
</ogc:SortBy>
</csw:Query>
</csw:GetRecords>
This primarily affects CKAN spatial harvester, who is not using a sortby parameter in the OWSLib GetRecords request construction.
We have temporarily added a default sortby by 'apiso:Modified' on the csw client side in GetRecords request as a workaround. https://github.com/GSA/OWSLib/commit/73ecdf74e6ee31d64d1b66ab786719bc0609aefb. It seems only Modified works as expected, not Title or Identifier. This workaround will be in place until this issue is addressed on the server side.
@FuhuXia aside: owslib.csw.getrecords2 accepts a sortby object just the same, so you could also push this to the caller, which IMHO would be cleaner.
@FuhuXia what issues did you run into when setting apiso:Title or apiso:Identifier as the sortable? I can't reproduce the issue. Any example(s)?
Agree with comment about code change in caller instead of getrecords2. The reason I did in getrecords2 is because it has owslib.fes ready, while in caller I have to import it first. Was trying to keep the temp fix as few lines of code change as possible, it is supposed be short-lived anyway.
The sortable with apiso:Title and apiso:Identifier are working with command line request. When I tried within getrecords2, for a few tries it seemed has no effect, all records came in same order with or without sortby, while it worked on the first try when I used apiso:Modified. I did not spend much time to find out why, since as long as I got one sortable working, the temp fix is done. :)
@FuhuXia we've discovered this problem in our NGDS CKAN build, guess it didn't fix it back in Jan when working on the project... It cost us several days to figure out what was going on. Is there a reliable fix yet?
Hi all
It seems to me that the described behaviour is not a bug, it is actually the expected one according to the standard. The CSW standard does not mandate a specific order by which requests should be retrieved when responding to a GetRecords request (please check section 10.8 of the document 07-006r1 here).
In addition to this, section 6.2.1 (Query Language support) of the standard states that the query language was designed in a similar fashion to SQL, which implies (in my opinion) that data can be stored in a database in order to ease implementation.
Pycsw does indeed store records in a database (currently postgresql/postgis or mysql or sqlite). Databases (and pycsw too) usually care about returning results as fast as possible. For this to be possible, the database uses a number of different strategies for processing queries. However none of these stategies enforce a specific sorting order on the resultset. This is just a side effect of how a database works and is to be expected. When a user wants the records to be queried and sorted, the SQL ORDER BY clause must be provided (and this carries a performance penalty, meaning that the query is slower to process). Please check the following wikipedia link for more info:
https://en.wikipedia.org/wiki/Order_by
For some more in-depth discussion about this please also check out this link and some of the links in the page: http://stackoverflow.com/questions/10064532/the-order-of-a-sql-select-statement-without-order-by-clause
and also: http://tkyte.blogspot.pt/2005/08/order-in-court.html
The CSW standard also features the SortBy parameter for the GetRecords operation (see section 10.8.4.12) whose purpose is exactly the same as the SQL ORDER BY clause. As @kvuppala states, when the SortBy parameter is included in the request, the records returned by pycsw are ordered as expected. So there does not seem to be a bug in pycsw.
Now, with that said, we may decide that pycsw should always return records in the same order. This would probably involve using a hidden ORDER BY clause when querying the database even if the user did not supply one. Personally I am not in favour of this, as it slows down query results for everyone. I guess the responsability should be on the client for making a correct request and not on the server.
Since most people seem to be using owslib as a client and both projects share the same family ;) maybe an optional parameter could be added to owslib to the getrecords operation in order to always ask for sorted results. What do you guys think @tomkralidis @kalxas @FuhuXia @smrazgs?
@ricardogsilva thanks for the in depth analysis/explanation. We followed this thinking when implementing in pycsw in the early days. Like any database client, if you want consistent ordering in your search results, then setting ogc:SortBy gives you exactly that.
OWSLib already supports setting ogc:SortBy in owslib.csw.CatalogueServiceWeb.getrecords2. In my experience the issue arises when folks / clients are not aware of the implications of not setting ogc:SortBy.
For the CKAN workflow:
CKAN -> OWSLib (owslib.csw.CatalogueServiceWeb.getrecords2 -> CSW service
I think the proper fix here is to update CKAN's use of OWSLib to set ogc:SortBy.
In any case it it worth adding some explanation on pagination and sorting results in the pycsw documentation as well as the website FAQ.
For good measure I will also bring this up in OGC for awareness (perhaps there are additional viewpoints/experiences that may help this issue overall -- maybe more explicit wording in the CSW spec can also help).
Having said this, if there is significant interest/demand we could add a optional configuration setting called repository.sortby_field or something which specifies a column to sort by on the server side for GetRecords requests. Queries not specifying ogc:SortBy and have this setting turned on will have default sorting done on the server side. Querying explicitly specifying ogc:SortBy will work as they do currently (if repository.sortby is set, it is overridden in this context). repository.sortby is not set by default and there would be documentation which elaborates on the performance implications if used/set. As @ricardogsilva mentions this has negative performance implications (even for those not without pagination workflow/requirements).
Update: the (draft) CSW 3.0 specification allows for a DefaultSortingAlgorithm URL reference in Capabilities XML:
A reference to a description of the default sort which is the sort that the server applies if no SortBy clause (see 7.3.4.11) is specified in a request. Absence of this constraint implies that the default sort is alphabetical by Title (see 6.6.3) in ascending order.
This is echoed again in Requirement 109:
If no sort is specified and if no default sort is specified in the capabilities document then it is assumed that the server will sort responses alphabetically by Title in ascending order.
Note: CKAN PR fix issued in https://github.com/ckan/ckanext-spatial/pull/136
In that case, it seems we'll need the default sorting for CSW3.0 after all. The fix proposed by @tomkralidis (adding an extra configuration parameter) seems great.
FYI this is now fixed for the CKAN use case in ckanext-spatial https://github.com/ckan/ckanext-spatial/pull/136
Moving this to 2.0.0 given the CSW3-based enhancement.
@tomkralidis Hi Tom, I think that the existence of the startPosition parameter of the GetRecords request should imply a consistent pagination behaviour, at least in the scenario where records are not deleted in a CSW catalogue but only modified or added. Ordering by title or fileIdentifier or any other element actually breaks the pagination very easily when a record is deleted but also when modified or added. This would require that the sort order be based on an internal identifier (like a classical relational table sequence, for example). I think that a future CSW specification should include some record management features.
Hi @tomkralidis , @ricardogsilva
I can't reproduce this problem on http://demo.pycsw.org/cite/csw?service=CSW&version=2.0.2&request=GetCapabilities anymore.
version: pycsw 2.5.dev0
request:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<csw:GetRecords xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" outputSchema="http://www.opengis.net/cat/csw/2.0.2" outputFormat="application/xml" version="2.0.2" service="CSW" resultType="results" startPosition="11" maxRecords="1" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd">
<csw:Query typeNames="csw:Record">
<csw:ElementSetName>brief</csw:ElementSetName>
</csw:Query>
</csw:GetRecords>
Does it mean that this problem doesn't have to be solved on the client side anymore because default pycsw implementation (without additional configuration) supports sorted paginated records?
Thanks in advance!
@DeordD it is still incumbent on the client to explicitly state sort order.