jasyncapi Assess the quality of open data in an open data portal

Create a tool to assess the quality of open data in an open data portal. a challenge by ODI Queensland

Build on prior r work:

testdat
An analysis by the ODI of CSV files on Data.gov.uk

Leverage existing validation tools:

CSVLint.io a tool from the ODI to validate CSV files
GoodTables by Open Knowledge Labs

Apply standards, best practices or quality measures:

Open Data Certificates by the ODI
Frictionless Data and Data Packages by Open Knowledge
W3C CSV for the Web
Tau - a metric to assess the timeliness of data in catalogues.

Assess an open data portal or two:

Use any or none of these suggestions to provide insights about the quality of open data and how it is published.

Help open data publishers improve so the data they publish can be used to deliver ongoing value.

Thinking about taking the challenge? Got questions? Reply below and we'll do our best to answer.

Mar 12 '16 01:03 Stephen-Gates

perhaps a web tool where you drop your file and it tells you what is needed to comply to standards?

Mar 14 '16 11:03 RMHogervorst

@RMHogervorst that is essentially what GoodTables does:

http://goodtables.okfnlabs.org

It is also available as a CLI or a python lib:

https://github.com/okfn/goodtables

And, we are currently finishing off our Data Quality Dashboards, which could be used (they pretty much meet the challenge already :)):

https://github.com/okfn/data-quality-dashboard
https://github.com/okfn/data-quality-cli

Example data for quality assessment:

https://github.com/okfn/data-quality-uk-25k-spend

We are currently working on the feature/refactor branch of all these data-quality-* codebases, and will be happy for contributions and questions in around a week.

Mar 14 '16 11:03 pwalsh

Oh great! That is very useful

Mar 14 '16 11:03 RMHogervorst

Thanks @pwalsh, great to see you here. I'll check out the data quality dashboard. Hi @RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.

Mar 14 '16 12:03 Stephen-Gates

Thanks for this suggestion @Stephen-Gates. Despite work already done in this area, I think there is still some scope for R tools. Ideas that come to mind:

A tool which could be a combination of testdat and @tierneyn's visdat that tests data for compliance with open data standards and visualises where departures occur within the data frame.
A wrapper for 'write.csv()' that tests data and writes it to ODI standards.

Mar 15 '16 01:03 MilesMcBain

Speaking as someone who has both a) worked at a data portal and b) have published my own data data, so I agree with your aims to ensure quality. However I hope that this will be a conscientiously constructive and collegial process, rather than what could become quite easily (ie without meaning to) a bit embarrassing for people who are shown their publishing systems/publications are considered 'poor quality'. We don't want to provide disincentives and shame those that are essentially altruistically publishing data at a time when there is no real incentive to do so.

It will also be important to define what is considered 'good quality'.... eg some non-tidy data are well suited to their purpose, as pointed out here by Jeff Leek http://simplystatistics.org/2016/02/17/non-tidy-data/

On Mon, Mar 14, 2016 at 11:04 PM, Stephen Gates [email protected] wrote:

Thanks @pwalsh https://github.com/pwalsh, great to see you here. I'll check out the data quality dashboard. Hi @RMHogervorst https://github.com/RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/9#issuecomment-196279660.

Mar 15 '16 02:03 ivanhanigan

@MilesMcBain I looked at visdat and was totally exciting to see it was inspired by CSV Fingerprints. I think your suggestion would be a wicked combo.

Mar 15 '16 02:03 Stephen-Gates

ivanhanigan Totally agree. This is not a name and shame. I have spoken with some portal owners and data publishers and they're keen to understand how to improve and demonstrate that they are improving over time. So perhaps a tool that graphs progress of time would be useful?

Re: what is good quality data?

My simple approach is "is it published as promised". E.g.

if you said you'd release it monthly, it should be
if you said find it here, it should be there
if you said it's a CSV, it should be
if you said column 2 is a date, it should be.

I'm sure there are more scientific definitions of data quality... feel free to use those also.

Mar 15 '16 02:03 Stephen-Gates

So these are actually two different use cases, there is the checking of meta data and the data.

Mar 15 '16 08:03 RMHogervorst

@RMHogervorst that's correct but the challenge is totally flexible. Focus on what helps you, the community, and data publishers - or something else entirely ;-)

Mar 15 '16 08:03 Stephen-Gates

@Stephen-Gates I suggest these use cases have many dimensions. I'd like to specify the aim more before exploring the possibilities. In particular I note the different 'quality benchmarks' applicable to data portals run for government depts vs portals run for scientists. The former might be replete with administrators with data curation high in their work priorities, while the latter may be cobbled together by scientists eschewing the compulsion to compete and instead opting for open science, or alternately reacting to funders/journals requirements to publish supporting information and data with papers. The expectations you might have for quality metadata/data in the former might well be a lot higher than for the latter (and this would be justifiable given the lack of resourcing funders/universities give scientists to engage in data publishing activities).

Another dimension that is not clear in this thread is the spectrum between open data and mediated data. Often mediated data is easily available with portals simply requiring user registration so they can collect download statistics and analyse usage by demographic groups, or to meet data depositors requests to be made aware of proposed re-use so that they can keep in contact and provide collegial support for downstream users of their data. These data are not technically open, but in practice they are essentially open. I suspect quality may differ between purely open and mediated-but-easy-to-get-at data portals, and this might be worth thinking about too.

My 2cents.

Mar 15 '16 21:03 ivanhanigan

@ivanhanigan Great points. I think Governments are equally resource constrained when it comes to publishing open data and the variation in quality will be equally diverse. I understand that many research data portals are not technically open and may not present an API to the catalogue. So if anyone was considering the challenge, I'd suggest using a government CKAN portal that presents an open API. You could explore data.gov.au or data.qld.gov.au (see http://docs.ckan.org/en/latest/api/index.html).

Mar 15 '16 22:03 Stephen-Gates

More food for thought:

Towards Common Methods for Assessing Open Data by the World Wide Web Foundation
The value of Open Data initiatives by Antonio Ibáñez Pascual
Peer Reviewed Publications on open data quality from the Vienna University
- as an aside one author has developed a CSVW metadata generator
- and a CSVW schema validator is under construction.

Mar 16 '16 13:03 Stephen-Gates

To me, thinking about data science goes together with assessing quality. I think this collection of data science links and this list of public datasets are relevant to this topic.

Apr 18 '16 11:04 cofiem

jasyncapi jasyncapi copied to clipboard

Assess the quality of open data in an open data portal

jasyncapi
jasyncapi copied to clipboard