opendatasurvey
opendatasurvey copied to clipboard
Add data quality indicators to the UI
As a {Product Owner}, I want to integrate quality assessments in the pages of a given dataset, as applicable, so I can start to highlight quality as an important dimension of analysis for the future.
- Example: For a tabular dataset, run GoodTables over the dataset and generate a report that shows the quality of the dataset, even if this quality is not yet part of the ranking mechanism.
- Example: for any dataset, ping the URLs at scheduled intervals (monthly?) to check the data is still available.
We should have a call with @pwalsh and @brew about it
Depending on how you guys will define quality in this context and how you decide to display it, you might find Data Quality CLI useful so I'll drop here a few words about it.
Data Quality CLI uses GoodTables to assert the quality of a data package and gives back a quality score (a.k.a. percent because it ranges from 0 to 100) based on structure errors, schema errors and timeliness. As you will see in the README, it's very configurable and it already has integration for CKAN instances.
The point of Data Quality CLI was to generate the data for Data Quality Dashboard which displays the results of the quality analysis (the scores). You can see here the quality dashboard for Northern Ireland CKAN. The dashboard has an /embed
route so you can include only the relevant bits in your page.
The main issue with the Data Quality Duo is that they haven't been brought up to date with the latest GoodTables API so they would really benefit contributions if you choose to use them.
If you have further questions about this, feel free to ask me. Good luck! ✌️
@morchickit We might also want @smth to look at this. How to associate each source location with a data quality 'badge' (or whatever is used to represent data quality).
@georgiana-b -great idea. My only problem is that not all the links to the datasets leads directly to the files. How should we deal with that?
@morchickit Can you give me an example? I'm trying to understand what "dataset" mean in this context.
Given that this was added to the mockups last year (http://okfnlabs.org/index-mockup/entry/), what, if anything, is required of me here?
We thought a tooltip is good to explain what GoodTables does/means. We were wondering if a link to the GoodTables website, or a short 1-2 sentence description would be helpful to explain what things like "GoodTables: Valid" or "Last seen: DATE" mean.
At the moment I have the impression this is not self-explaining. What do you think @smth
I would agree they are not self explanatory (though nothing here really is). I think these badges should be clickable, and link to some sort of (external) GoodTables page.
@georgiana-b - See this - http://professionnels.ign.fr/geofla - this page was linked to the index and describes the dataset the data is actually accessed from this URL http://professionnels.ign.fr/geofla#tab-3
Or this page from Israel that links to other links that links to the dataset - https://foi.gov.il/he/search/site/?f%5B0%5D=im_field_mmdsubjects%3A367
So, after seeing the dataset examples and the mockup for this I have the following observations:
You have to discuss the conditions necessary to get that "Valid" badge. Since a dataset is made of several data files it's probable that some of its files will be valid, some will not. How valid a dataset is depends on how valid each of its constituent files are. For example in Data Quality Dashboard for UK spend, because we consider a valid file to be 100% correct there are 0 valid files even though many come close so the average correctness score is 46%.
Whether you use Data Quality CLI or GoodTables directly, you have to transform those links into a standardized version of a dataset i.e. a DataPackage.
If you just send http://professionnels.ign.fr/geofla#tab-3 or http://professionnels.ign.fr/geofla to GoodTables, it will interpret it as an HTML page and thus an invalid file.
To get this automatic quality analysis somebody will have to make a datapackage for each dataset, posibly using DQ-CLI's init
& generate
commands. GoodTables can assess datapackages if you use the datapackage
preset.