opendatasurvey icon indicating copy to clipboard operation
opendatasurvey copied to clipboard

Add data quality indicators to the UI

Open pwalsh opened this issue 7 years ago • 10 comments

As a {Product Owner}, I want to integrate quality assessments in the pages of a given dataset, as applicable, so I can start to highlight quality as an important dimension of analysis for the future.

  • Example: For a tabular dataset, run GoodTables over the dataset and generate a report that shows the quality of the dataset, even if this quality is not yet part of the ranking mechanism.
  • Example: for any dataset, ping the URLs at scheduled intervals (monthly?) to check the data is still available.

pwalsh avatar Jan 04 '17 13:01 pwalsh

We should have a call with @pwalsh and @brew about it

morchickit avatar Jan 06 '17 16:01 morchickit

Depending on how you guys will define quality in this context and how you decide to display it, you might find Data Quality CLI useful so I'll drop here a few words about it. Data Quality CLI uses GoodTables to assert the quality of a data package and gives back a quality score (a.k.a. percent because it ranges from 0 to 100) based on structure errors, schema errors and timeliness. As you will see in the README, it's very configurable and it already has integration for CKAN instances. The point of Data Quality CLI was to generate the data for Data Quality Dashboard which displays the results of the quality analysis (the scores). You can see here the quality dashboard for Northern Ireland CKAN. The dashboard has an /embed route so you can include only the relevant bits in your page. The main issue with the Data Quality Duo is that they haven't been brought up to date with the latest GoodTables API so they would really benefit contributions if you choose to use them. If you have further questions about this, feel free to ask me. Good luck! ✌️

georgiana-b avatar Apr 06 '17 13:04 georgiana-b

@morchickit We might also want @smth to look at this. How to associate each source location with a data quality 'badge' (or whatever is used to represent data quality).

brew avatar Apr 07 '17 11:04 brew

@georgiana-b -great idea. My only problem is that not all the links to the datasets leads directly to the files. How should we deal with that?

morchickit avatar Apr 10 '17 08:04 morchickit

@morchickit Can you give me an example? I'm trying to understand what "dataset" mean in this context.

georgiana-b avatar Apr 18 '17 15:04 georgiana-b

Given that this was added to the mockups last year (http://okfnlabs.org/index-mockup/entry/), what, if anything, is required of me here?

smth avatar Apr 19 '17 07:04 smth

We thought a tooltip is good to explain what GoodTables does/means. We were wondering if a link to the GoodTables website, or a short 1-2 sentence description would be helpful to explain what things like "GoodTables: Valid" or "Last seen: DATE" mean.

At the moment I have the impression this is not self-explaining. What do you think @smth

dannylammerhirt avatar Apr 19 '17 09:04 dannylammerhirt

I would agree they are not self explanatory (though nothing here really is). I think these badges should be clickable, and link to some sort of (external) GoodTables page.

smth avatar Apr 19 '17 09:04 smth

@georgiana-b - See this - http://professionnels.ign.fr/geofla - this page was linked to the index and describes the dataset the data is actually accessed from this URL http://professionnels.ign.fr/geofla#tab-3

Or this page from Israel that links to other links that links to the dataset - https://foi.gov.il/he/search/site/?f%5B0%5D=im_field_mmdsubjects%3A367

morchickit avatar Apr 19 '17 09:04 morchickit

So, after seeing the dataset examples and the mockup for this I have the following observations:

You have to discuss the conditions necessary to get that "Valid" badge. Since a dataset is made of several data files it's probable that some of its files will be valid, some will not. How valid a dataset is depends on how valid each of its constituent files are. For example in Data Quality Dashboard for UK spend, because we consider a valid file to be 100% correct there are 0 valid files even though many come close so the average correctness score is 46%.

Whether you use Data Quality CLI or GoodTables directly, you have to transform those links into a standardized version of a dataset i.e. a DataPackage. If you just send http://professionnels.ign.fr/geofla#tab-3 or http://professionnels.ign.fr/geofla to GoodTables, it will interpret it as an HTML page and thus an invalid file. To get this automatic quality analysis somebody will have to make a datapackage for each dataset, posibly using DQ-CLI's init & generate commands. GoodTables can assess datapackages if you use the datapackage preset.

georgiana-b avatar Apr 19 '17 16:04 georgiana-b