ckanext-spatial ckan-pycsw load fails with 3.5M datasets

ckan-pycsw load fails with 3.5M datasets

Open adborden opened this issue 5 years ago • 3 comments

The ckan-pycsw load job isn't built to handle a large number of datasets. It pulls all the datasets in memory, then all the existing pycsw records, then does set operations in order to figure out new, changed, and deleted datasets. We started seeing the job run out of memory on the machine when working with 3.5 million datasets in CKAN. Additionally, as the datasets grow, the job expects to be the sole worker, running as a cron job once per day. It would be nice if this work could be split up over time and machines.

As a hack, I did some work to fetch datasets in batches of 1000 and process them. But ultimately, I think you would want the pycsw update to happen in "real time" as part of harvesting. If the dataset is updated, it should be updated in pycsw. If the package is deleted, it should be removed. If the dataset doesn't exist, add it.

Mar 02 '19 08:03 adborden

This work has already been done as part of the PublicaMundi EU project: https://github.com/PublicaMundi https://github.com/PublicaMundi/ckanext-publicamundi We are going to port this to the latest CKAN in the next 6 months

Mar 02 '19 11:03 kalxas

@kalxas awesome, thank you! Let me know if I can help this effort. Are you planning on adding this to ckanext-spatial?

Mar 02 '19 15:03 adborden

No, this is specific to ckanext-publicamundi work and depends on custom metadata schema plugin we implemented in order to fully support ISO19115 in CKAN. Our plan is to release this work in several extensions within 2019

Mar 02 '19 16:03 kalxas

ckanext-spatial ckanext-spatial copied to clipboard

ckan-pycsw load fails with 3.5M datasets

ckanext-spatial
ckanext-spatial copied to clipboard