ckanext-spatial
ckanext-spatial copied to clipboard
ckan-pycsw load fails with 3.5M datasets
The ckan-pycsw load job isn't built to handle a large number of datasets. It pulls all the datasets in memory, then all the existing pycsw records, then does set operations in order to figure out new, changed, and deleted datasets. We started seeing the job run out of memory on the machine when working with 3.5 million datasets in CKAN. Additionally, as the datasets grow, the job expects to be the sole worker, running as a cron job once per day. It would be nice if this work could be split up over time and machines.
As a hack, I did some work to fetch datasets in batches of 1000 and process them. But ultimately, I think you would want the pycsw update to happen in "real time" as part of harvesting. If the dataset is updated, it should be updated in pycsw. If the package is deleted, it should be removed. If the dataset doesn't exist, add it.
This work has already been done as part of the PublicaMundi EU project: https://github.com/PublicaMundi https://github.com/PublicaMundi/ckanext-publicamundi We are going to port this to the latest CKAN in the next 6 months
@kalxas awesome, thank you! Let me know if I can help this effort. Are you planning on adding this to ckanext-spatial?
No, this is specific to ckanext-publicamundi work and depends on custom metadata schema plugin we implemented in order to fully support ISO19115 in CKAN. Our plan is to release this work in several extensions within 2019