jasyncapi icon indicating copy to clipboard operation
jasyncapi copied to clipboard

Assess the quality of open data in an open data portal

Open Stephen-Gates opened this issue 8 years ago • 14 comments

Create a tool to assess the quality of open data in an open data portal. a challenge by ODI Queensland

Build on prior r work:

Leverage existing validation tools:

Apply standards, best practices or quality measures:

Assess an open data portal or two:

Use any or none of these suggestions to provide insights about the quality of open data and how it is published.

Help open data publishers improve so the data they publish can be used to deliver ongoing value.

Thinking about taking the challenge? Got questions? Reply below and we'll do our best to answer.

Stephen-Gates avatar Mar 12 '16 01:03 Stephen-Gates

perhaps a web tool where you drop your file and it tells you what is needed to comply to standards?

RMHogervorst avatar Mar 14 '16 11:03 RMHogervorst

@RMHogervorst that is essentially what GoodTables does:

  • http://goodtables.okfnlabs.org

It is also available as a CLI or a python lib:

  • https://github.com/okfn/goodtables

And, we are currently finishing off our Data Quality Dashboards, which could be used (they pretty much meet the challenge already :)):

  • https://github.com/okfn/data-quality-dashboard
  • https://github.com/okfn/data-quality-cli

Example data for quality assessment:

  • https://github.com/okfn/data-quality-uk-25k-spend

We are currently working on the feature/refactor branch of all these data-quality-* codebases, and will be happy for contributions and questions in around a week.

pwalsh avatar Mar 14 '16 11:03 pwalsh

Oh great! That is very useful

RMHogervorst avatar Mar 14 '16 11:03 RMHogervorst

Thanks @pwalsh, great to see you here. I'll check out the data quality dashboard. Hi @RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.

Stephen-Gates avatar Mar 14 '16 12:03 Stephen-Gates

Thanks for this suggestion @Stephen-Gates. Despite work already done in this area, I think there is still some scope for R tools. Ideas that come to mind:

  • A tool which could be a combination of testdat and @tierneyn's visdat that tests data for compliance with open data standards and visualises where departures occur within the data frame.
  • A wrapper for 'write.csv()' that tests data and writes it to ODI standards.

MilesMcBain avatar Mar 15 '16 01:03 MilesMcBain

Speaking as someone who has both a) worked at a data portal and b) have published my own data data, so I agree with your aims to ensure quality. However I hope that this will be a conscientiously constructive and collegial process, rather than what could become quite easily (ie without meaning to) a bit embarrassing for people who are shown their publishing systems/publications are considered 'poor quality'. We don't want to provide disincentives and shame those that are essentially altruistically publishing data at a time when there is no real incentive to do so.

It will also be important to define what is considered 'good quality'.... eg some non-tidy data are well suited to their purpose, as pointed out here by Jeff Leek http://simplystatistics.org/2016/02/17/non-tidy-data/

On Mon, Mar 14, 2016 at 11:04 PM, Stephen Gates [email protected] wrote:

Thanks @pwalsh https://github.com/pwalsh, great to see you here. I'll check out the data quality dashboard. Hi @RMHogervorst https://github.com/RMHogervorst, my motivation is to return some stats to a portal owner, and the data publishers that use it, to illustrate quality. My experience is that some data is poorly published and not reliably refreshed. I'd like to quantify that, show the publishers and encourage some corrective actions.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/auunconf/issues/9#issuecomment-196279660.

ivanhanigan avatar Mar 15 '16 02:03 ivanhanigan

@MilesMcBain I looked at visdat and was totally exciting to see it was inspired by CSV Fingerprints. I think your suggestion would be a wicked combo.

Stephen-Gates avatar Mar 15 '16 02:03 Stephen-Gates

ivanhanigan Totally agree. This is not a name and shame. I have spoken with some portal owners and data publishers and they're keen to understand how to improve and demonstrate that they are improving over time. So perhaps a tool that graphs progress of time would be useful?

Re: what is good quality data?

My simple approach is "is it published as promised". E.g.

  • if you said you'd release it monthly, it should be
  • if you said find it here, it should be there
  • if you said it's a CSV, it should be
  • if you said column 2 is a date, it should be.

I'm sure there are more scientific definitions of data quality... feel free to use those also.

Stephen-Gates avatar Mar 15 '16 02:03 Stephen-Gates

So these are actually two different use cases, there is the checking of meta data and the data.

RMHogervorst avatar Mar 15 '16 08:03 RMHogervorst

@RMHogervorst that's correct but the challenge is totally flexible. Focus on what helps you, the community, and data publishers - or something else entirely ;-)

Stephen-Gates avatar Mar 15 '16 08:03 Stephen-Gates

@Stephen-Gates I suggest these use cases have many dimensions. I'd like to specify the aim more before exploring the possibilities. In particular I note the different 'quality benchmarks' applicable to data portals run for government depts vs portals run for scientists. The former might be replete with administrators with data curation high in their work priorities, while the latter may be cobbled together by scientists eschewing the compulsion to compete and instead opting for open science, or alternately reacting to funders/journals requirements to publish supporting information and data with papers. The expectations you might have for quality metadata/data in the former might well be a lot higher than for the latter (and this would be justifiable given the lack of resourcing funders/universities give scientists to engage in data publishing activities).

Another dimension that is not clear in this thread is the spectrum between open data and mediated data. Often mediated data is easily available with portals simply requiring user registration so they can collect download statistics and analyse usage by demographic groups, or to meet data depositors requests to be made aware of proposed re-use so that they can keep in contact and provide collegial support for downstream users of their data. These data are not technically open, but in practice they are essentially open. I suspect quality may differ between purely open and mediated-but-easy-to-get-at data portals, and this might be worth thinking about too.

My 2cents.

ivanhanigan avatar Mar 15 '16 21:03 ivanhanigan

@ivanhanigan Great points. I think Governments are equally resource constrained when it comes to publishing open data and the variation in quality will be equally diverse. I understand that many research data portals are not technically open and may not present an API to the catalogue. So if anyone was considering the challenge, I'd suggest using a government CKAN portal that presents an open API. You could explore data.gov.au or data.qld.gov.au (see http://docs.ckan.org/en/latest/api/index.html).

Stephen-Gates avatar Mar 15 '16 22:03 Stephen-Gates

More food for thought:

Stephen-Gates avatar Mar 16 '16 13:03 Stephen-Gates

To me, thinking about data science goes together with assessing quality. I think this collection of data science links and this list of public datasets are relevant to this topic.

cofiem avatar Apr 18 '16 11:04 cofiem