ckan
ckan copied to clipboard
Update Datastore Datapusher documentation to mention Express Loader
The current Datastore documentation makes no mention of Express Loader (https://github.com/davidread/ckanext-xloader), which is a marked improvement over Datapusher.
https://ckan.org/2017/11/20/express-loader-for-ckan-datastore/
For legacy purposes, I guess Datapusher needs to be maintained. But for new installations, shouldn't Express Loader be the recommended mechanism?
cc @davidread
@jqnatividad This seems sensible, given that I'd not heard of express-loader until just now.
Of course, I'm apparently not as informed as I should be, but I spend most of my time in the docs.
I'm supportive of this. Since xloader was launched it's been used in a dozen or so CKANs in OpenGov. And we had some good validation of it being used and improved (PR1 PR2 by @metaodi who switched the Open Data Zurich site from datapusher to xloader.
I have two concerns about xloader being the recommended choice:
-
datapusher autodetects column types, whereas xloader assumes they are all text. So either an administrator manually changes them using Data Dictionary or they . Users may miss that when doing SQL queries on datastore. I guess they can still cast a column to int (say) in the query though. I think it is a reasonable trade-off for the speed, robustness, simplicity & debugability of xloader.
-
https://github.com/davidread/ckanext-xloader/issues/41 Loading large CSVs (100Mb+) into Postgres causes Postgres to consume a lot of CPU load. This appears to highlight an issue with displaying resources and resource pages can become unresponsive during these times. Whilst there is a solution, in adding a second postgres server, it could do with a bit more battle testing. But maybe it's not going to get that without more people knowing about it.
It would be good to get comments from @amercader @wardi and the rest of the tech team on this.
If we make xloader the official choice, we should also move it from github.com/davidread/ckanext-xloader to github.com/ckan/ckanext-xloader (I am supporting it).
As all our sites have the Datastore installed, before Express Loader, the single largest source of ongoing support issues was the Datapusher.
One day, it will work fine without a hitch (albeit very slowly), and then the following day, it won't because of some data quality issue in the last row of a large CSV, which aborts the whole job, which is even more frustrating for large CSVs as it does so when it encounters the problem hours into the datapusher job!
@davidread, your first concern is a feature not a bug. Uploading everything as text, and casting the column after its loaded is a much better workflow, with a more graceful error handling (thanks to the more verbose logging messages of xloader too) in my book.
As for your second concern, we have some insight that we add to https://github.com/davidread/ckanext-xloader/issues/41 that should help.
I've had likewise experiences, and not just with CKAN. Guessing column types always causes more problems then it solves, especially when it starts doing things like truncating 0s because it thinks that UPC is a number.
IMO the "better" solution is to continue expanding the data dictionary feature. We should only load the data after the data dictionary has been defined. Load the first couple of rows using something like papaparser to get column names and type hints. We could also allow selecting previously defined data dictionaries (for example we have one site that uploads many datasets of energy usage samples, one per province. The data dictionary would be identical for every one)
The pros as I see it:
- Not using variable length text fields (better data layout on disk without vacuums)
- Avoids forcing the use of
CAST
/::
everywhere, which is more user-friendly when writing queries. - Properly typed, efficient indexes (instead of text)
- Simple user-friendly form to define the schema
For legacy purposes, I guess Datapusher needs to be maintained. But for new installations, shouldn't Express Loader be the recommended mechanism?
@jqnatividad switching away from datapusher should be trivial, xloader is mostly a drop-in replacement. I would vote for immediate deprecation and removal on the next release unless there's a strong reason to keep it. We would backport security fixes only. Replace the documentation with xloader along with a footnote to see earlier versions of the documentation for help with legacy datapusher installs.
@TkTech, xloader's genesis was this discussion 3 years ago :)
https://github.com/ckan/ideas-and-roadmap/issues/150
And I'm all for expanding the data dictionary too to help not just with xloading, but as as way to get descriptive statistics metadata as well.
https://github.com/ckan/ideas-and-roadmap/issues/196
But instead of using csvkit to compile the descriptive statistics, leverage the pg_statistic_ext statistics compiled by Postgres after an ANALYZE.
Regarding the statistics, I'm all for it but rather then using csvkit and an asynchronous job it should be just a part of the loading process. Once the data is in postgres we can easily get all of that information without having to use something like csvkit.
- No additional read of the data via csvkit
- No additional dependency
- Since it's SQL on the post-loaded data it works regardless of the original import type (csv, excel, JSONL, etc...)
- A lot of these statistics are already known by the postgres planner and available in the statistics table (most common, min, max, unique count, etc..)
@TkTech LOL! Our comments re using Postgres statistics crossed path in the internets :)
I like the idea of simply including the data dictionary workflow as an obvious next step in the resource addition process. (Create dataset -> add resource -> define dictionary)
Rather than loading just a few rows, or maybe in addition to that, we could try and surface data validation exceptions on the client before upload. For example, a user uploads a 100MB+ csv, and we gradually validate the columns to surface any data that wouldn't work given the user's selected datatype.
Rather than loading just a few rows, or maybe in addition to that, we could try and surface data validation exceptions on the client before upload. For example, a user uploads a 100MB+ csv, and we gradually validate the columns to surface any data that wouldn't work given the user's selected datatype.
I'm fine with this as long as it's 100% client side and running transparently in the background, but I'd give it a pretty low priority given restricted time. Reading the first few rows is just to populate the initial column names and type guesses on the form.
Thanks all for this very useful discussion. There's lots to unpack here so I'll summarize my views separately.
-
Mention Express Loader in docs: absolutely. We should reword the DataPusher section to "Automatically getting data in the DataStore" and describe EL there (with or without mentinoning DataPusher too, see next point)
-
Dropping DataPusher and replacing it with EL: I'm happy with this is the rest of the team is happy. I suffered DataPusher issues like everyone so if EL provides a more robust experience then we should embrace it. But one thing I'd say for the DataPusher is this, many many people uses it (basically because there was no other option) and it I'd venture to say that for all it's flaws it works in a lot of cases (I don't have data to sustain this other that my own experience). My point is that EL should match this. As I understand the main caveats of EL right now are:
- No type guessing. I don't think it's a problem as I agree with others than we should move towards explicitly-defined data schemas (or data dictionaries). More on this later. The only problem I can think with this is whether if having all columns defined as text by default affects the visualizations. Without having actually tried, I'd imagine that eg charts wouldn't work correctly, and not sure about maps (lat/lon columns). We should review what's the default behaviour in the default visualizations when EL is activated (this must have been assessed in the opengov instances using it)
- No support for private datasets. I got that from the EL README, not sure if it's still relevant. That could be a blocker for adoption to me. @davidread is this something that can be overcome?
- Performance issues. Looking at the issue I understand that they can be worked around.
Again, these are not listed to argue against EL adoption, just as points that need to be clarified
-
Moving EL to the CKAN org: yes, that makes perfect sense. But I'm going to be completely honest. If EL is going to be the default way of getting data in DataStore why not make it part of the
datastore
plugin (ckan.datastore.automatic_upload
defaulting to True)? It's clear that 99% of people using the DataStore (wild guess here but >80% of CKANs?) wants data to be automatically uploaded, so shouldn't this functionality be backed into the datastore itself? I know we have the argument of easier iterations and releases when living in a separate extension, but we also have the counter-argument of less attention by maintainers, as it's the case with DataPusher now. What do people (especially @davidread) think? -
Defining the Data Schema before uploading (can you tell that I prefer the term Schema instead of Dictionary ;) ): this is absolutely the way to go, and one we have been working heavily on ckanext-validation. We have a working prototype widget to infer the fields client-side and present an interface to the publishers so they can tweak the field types directly into the form (see screenshot below). This stores the information as a Table Schema object in the resource, that could be then used by EL (or DataPusher) to create the relevant field types. It only needs some UX and UI polishing that could be fitted in future wider work around the dataset form. IMO given all the work done I think this is the best path forward for extending the current data dictionary support and having the best of both worlds (proper field types and EL). Would love to see if there is interest in getting this in core (and resources!)
The current prototype can be tested live at https://demo-validation.l3.ckan.io/, PM me for a user / password.
Hi @amercader, On Data Schema, that'd be excellent! Though switching over might be non-trivial given all the work that was done to store Dictionary values in Postgres comments.
I've been tracking the Frictionless Data project and we've been planning on installing ckanext-datapackager as a standard extension especially since we've announced support for Fiscal Data Package and Data Packages in general. Can ckanext-validation and ckanext-datapackager coexist?
+1M to data schemas before uploading data
open.canada.ca has been doing something like this internally with ckanext-recombinant for a couple years now, and we have departments lined up to use it because the users love it and the data quality is awesome.
Recombinant benefits:
- controlled lists (single and multiple select) are supported
- schemas are applied across organizations so we can easily merge and publish data from all orgs
- schemas are managed by a developer who can help choose the best primary keys, remove redundant information, link to existing controlled lists and migrate old data as required
- custom validation is possible (e.g. field B required only when field A is a specific value -- this is a very common requirement)
- data is validated first in a provided Excel template so users can immediately see their errors before uploading
- validation is enforced by the datastore database using triggers (very fast!) and whole update is rolled back on errors (very safe!)
- data is updated on an incremental basis and old data is kept in datastore tables
- triggers are used to record which user updated every row, and when every row was created and updated
Recombinant drawbacks:
- schemas are applied across organizations, not ad-hoc for individual datasets
- schemas need to be written by a developer (bottleneck!)
- custom validation needs to be written by a developer (in Excel and pl/pgsql functions)
- migrations are required on some schema changes, written by a developer
- updating data requires copy and paste from datatable preview or keeping by old templates which can cause conflicts and re-uploading old data
Data schemas might benefit from some of the lessons we've learned from recombinant, e.g.
- Thought about providing templates based on the schema already populated with validation rules? Users would love it (immediate error reporting), and the code is mostly there in recombinant.
- What about incremental updates instead or reloading all the data all the time? It's been good for recombinant and enabled regular reporting on all the data types we collect. Too difficult?
- Separate the data schema from the dataset so that the people allowed to upload data aren't the same ones designing the schema? This is a common, and pretty darn useful requirement.
- Custom validation? Extend JSON schema enough to cover enough of the missing cases?
- Enforced validation? We can prevent bad data with triggers instead of reporting on it afterwards. That's good for some users and some types of data.
This got lost in my long-winded comment: ckanext-validation and the the approach @amercader describes looks awesome and super useful without any changes!
Wait, is this an issue about xloader?
Yes, I think we should recommend xloader over datapusher :-)
Covering off the caveats that @amercader raises:
i. No type guessing potentially affecting vizualizations.
There doesn't seem to be a problem with it interpreting the text as numbers:

However there is an issue with dates (e.g. '2018-01-25') which end up just as a year, distorting the gold prices example:

So this seems a concern for datasets which are time series like this one.
ii. Private dataset support.
I've now tested this and it works fine with private datasets. 👍
iii. Performance issues.
Looking at the issue I understand that they can be worked around.
Indeed. I'll look at this more no though.
I've finished the work on CKAN performance that was the key concern on this ticket.
Yes there is the concern about recline not graphing dates until the Data Dictionary is edited to say they are a date. But I reckon we can live with that for the improved loading. (And btw I'm fine with renaming Data Dictionary to Data Schema - I think the latter is more commonly used.)
I'll do a PR to change the recommendation in the docs from DataPusher to Express Loader.
I'll also look at moving Express Loader into the Datastore extension (i.e. in the core ckan repo), as @amercader suggests. I'll ensure there is an option to disable it, so that datapusher or some other loader can still be used.
I have updated and mentioned Express Loader
in DataStore Extension document in PR https://github.com/ckan/ckan/pull/6972. Do we need to completely obsolete the DataPusher
?
https://github.com/ckan/ckan/issues/8044