ideas icon indicating copy to clipboard operation
ideas copied to clipboard

"Download these search results as CSV/JSON" button

Open davidread opened this issue 6 years ago • 7 comments

Situation: use a CKAN website to do a search for datasets. You want to download the search results as a CSV so you can work through them methodically and do further analysis.

We got asked for this many times in DGU. Often by departments wanting a spreadsheet to keep track of their own datasets. I'm wondering if this is a useful feature in CKAN

Yes, we have a the package_search API. But it is not easy converting the search filters in the URL to the form needed by package_search. And that always returns JSON when plenty of people want CSV. A simple serialization should suffice for this purpose - not all the resources and every field - for that they chould use json. We could make it an option in package_search to return 'simple_csv', to hint that it is not complete.

A simple way to implement it is outlined here: https://stackoverflow.com/a/52081874/1512326

davidread avatar Aug 29 '18 16:08 davidread

Is this basically ckanapi-exporter integrated in CKAN? Or how is it different?

metaodi avatar Aug 29 '18 17:08 metaodi

Needs to be a background job, especially if it's a UI button. Queries on sites like NDHS could result in CSVs with 20 million+ rows.

TkTech avatar Aug 30 '18 00:08 TkTech

Thanks @metaodi and @TkTech.

Good to know that ckanapi-exporter is an alternative for use by someone with command-line skills. Or for the web implementation we might steal a bit of the package to CSV code.

@TkTech Good point that this would probably need a background job for sites with scale.

My current clients don't have a need for this, but pleased to be collecting wisdom on this, for when a site does want this in the future.

davidread avatar Aug 30 '18 09:08 davidread

@TkTech I would like to agree with proposal that I need to create a background job for large scale site. As for small to medium scale site, my current solution is enough to handle the workload and traffic. If I have some spare time, I will start to work on the modification.

@metaodi I think david has already answered your question. Again, I'm dealing with clients who has less technology background as we did. They are not familiar with CLI. So my solution guide line is to design a complete UI for them to use.

LingboTang avatar Aug 30 '18 16:08 LingboTang

Creating a controller that pages through the package_search results and streams out a CSV would be fairly straightforward. It would look a lot like the existing datastore "dump" controller https://github.com/ckan/ckan/blob/master/ckanext/datastore/controller.py#L40

wardi avatar Aug 30 '18 16:08 wardi

@lingbotang was not aware work on this had already started! It's definitely a nice feature, and it's great that it works for your use case, but might belong best in an extension rather then core CKAN for now. I will play devil's advocate on pretty much any PR for new features being added to core that would have performance or scaling implications. The general goal (see the projects tab on GitHub) is that all features of core CKAN should work happily with 20 million datasets.

TkTech avatar Aug 30 '18 17:08 TkTech

@TkTech To be fair to @LingboTang, I think their work is in their own extension.

Even better would be for making it a separate extension that anyone could use.

My personal preference would be to see this in core, with the scaling sorted, at some point in the future.

davidread avatar Aug 30 '18 19:08 davidread