ideas icon indicating copy to clipboard operation
ideas copied to clipboard

Datastore: Add interface for exporting custom formats

Open torfsen opened this issue 8 years ago • 6 comments

Currently the datastore can export data in multiple formats (#3390) but some interesting ones are still missing (e.g. Excel, LibreOffice). During a discussion on the ckan-dev mailing list, @wardi suggested to allow implementations of additional export formats via an interface.

Here's how such an interface could look:

  • datastore_get_export_formats() would return a dict that maps format identifiers (e.g. xlsx) to dicts. The latter contain a UI-label for the format (e.g. {'label': 'Microsoft Excel XLSX'}).
  • datastore_export(format, response, fields, options) would write the datastore fields to the response using the given format. options is a dict that contains additional options (e.g. the bom option from ckan/ckan#3390).

torfsen avatar Mar 24 '17 15:03 torfsen

:+1:

For streaming datastore_export(format, response, fields, options) would need to be a context manager that returns an object with a writerows() method, right?

wardi avatar Mar 24 '17 15:03 wardi

@wardi Yes, at least that's what the current implementation looks like (the method is writerow(list_of_values)).

torfsen avatar Mar 24 '17 15:03 torfsen

Might be a great time to switch this code to writerows() to skip the per-row overhead

wardi avatar Mar 24 '17 15:03 wardi

When we do this, we need to think about large tables in the datastore.

Right now, the current implementation is not performant and we get timeouts as the code blocks on completing the serialization.

Also, this operation is expensive and we should implement some caching mechanism. Once a datastore table is stored in one format, it should persist until that table is modified.

On one implementation, we leveraged nginx to do this (https://github.com/ckan/ideas-and-roadmap/issues/188), and we added some code to the datapusher to purge the nginx cache when a datastore table is updated.

Finally, the conversion should be done asynchronously.

On a related topic, maybe we can even repurpose this to allow users to optionally apply a filter whilst exporting, so it can be used for not just for saving to alternate formats, but for saving subsets of the data as well.

Maybe one of the supported options in datastore_export is a SQL where clause.

jqnatividad avatar Mar 28 '17 23:03 jqnatividad

On one implementation with a datastore table of almost 1M rows, "export as" was timing out even with generous timeout limits.

We ended up implementing the "export as" using external tools.

For exporting to JSON, we ended up using https://github.com/Keyang/node-csvtojson.

For reference, the 1M row table that took 20 mins to serialize to JSON using the current implementation (https://github.com/ckan/ckan/pull/3390). node-csvtojson took only 1.5 mins. We're calling it via CLI spawned as a subprocess.

The thinking was since we have the CSV already in the filesystem that was used to feed the datapusher, just use that instead of getting the data from the datastore.

After datapusher finishes, we spawn node-csvtojson, after which, we warm up the nginx cache with a simple curl.

Perhaps, this approach can be generalized so that the alternate formats are created async by the datapusher whilst inserting data into the datastore.

That is not to say the "export as" interface is no longer required even with the datapusher "post-processor exporter" approach, especially if it support SQL filters.

jqnatividad avatar Mar 28 '17 23:03 jqnatividad

@jqnatividad clearly there's lots of room for improvement.

For one we should be asking postgres to dump to a csv file stream and feed that data to the writers instead of using datastore_search with all its wasted conversion, repeated permission checking and total record count calculations. Saving subsets of the data is supported too. I'm sure we can get it down to a small fraction of the time it takes now.

There will always be datasets that will be too large for streaming the converted data, though. Should we mark these as off-limits for stream-conversion? Maybe a configuration setting with a maximum row count to be served from these endpoints?

wardi avatar Mar 29 '17 00:03 wardi