the_od_bods icon indicating copy to clipboard operation
the_od_bods copied to clipboard

Create licensing cheat sheet

Open KarenJewell opened this issue 3 years ago • 5 comments

Is your feature request related to a problem? Please describe. Suggested at SODU2022. Open data licensing is a minefield and often appear similar but with slight nuances. We would like to present a simple quick reference guide to the different licensing options and what that might mean for a user wanting to use a dataset of that license.

Describe the solution you'd like A table on a static page in Resources or About to provide a quick reference to the type of Open Data licences, what you can do with them, and what they typically apply to.

Describe alternatives you've considered There may be a resource out in the interwebs already, not looked for it yet and would be good to reference.

Additional context None.

KarenJewell avatar Nov 06 '22 19:11 KarenJewell

I was going to look at creating this.

However, if we look at the page Analytics -> Data Licensing we can see that the first and third categories are in fact the same and should be combined. That would take care of 1015 of 1584 datasets, or 64% of licences.

The second category ' No licence' - ie the publisher doesn't state or we can't tell what licence is being claimed. That comprises 448 / 1584 or a further 28% of dataset licences.

The next category is 'Not open' which shouldn't be here. That's 45/1548 or almost 3%

That just leaves around a total of 5% of 'other licence types.

I had a search to see if there is an issue open to better process the licences (eg combining the two version of OGL3) but can't see one. I'd suggest that we need that first, then look at this one.

Is there a way to identify a licence from the list such as the one above, and see what data sets use that one. This would be handy for the "not open" and the "no licence" ones in particular.

watty62 avatar Dec 08 '22 17:12 watty62

I would suggest having a look at the listing .csv/json output from the resources page - but licensing is a funny one because we process it about 6 times in the pipeline (which is mad)

Alternatively, the markdown files in jkan/datasets is more accurate.

KarenJewell avatar Dec 08 '22 18:12 KarenJewell

I had a look at a few files.

These are as much notes to myself as anything ...

It looks like all? the licence fixing is being done in merge_data.py in the tidy_licence function starting at line 636.

At lines 651 - 656 it looks like each version of OGL 3 should be converted to a single value "Open Government Licence v3.0" in the dictionary in which the keys are the text as it can appear in the original and the return values is a standardised text.

     "https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/": "Open Government Licence v3.0",
     "Open Government Licence 3.0 (United Kingdom)": "Open Government Licence v3.0",
     "UK Open Government Licence (OGL)": "Open Government Licence v3.0",
     "Open Government": "Open Government Licence v3.0",
     "uk-ogl": "Open Government Licence v3.0",
     "OGL3": "Open Government Licence v3.0",

I can't for the moment work out where the third-most popular (with 397 uses) licence version with the full address for national archives is getting through - nor where it is originating from.

I'll look at it again with fresh eyes.

watty62 avatar Dec 08 '22 20:12 watty62

A bunch of these are http vs https for the urls we're matching on for known licenses.

As a quick fix for that I've added http version where we only had https. Could really do with cleaning up how we're handling licences, turning URLs to names in merge_data.py and then back to URLs in export2jkan.py is a bit convoluted.

I was going to trigger a sync to check I hadn't broken anything but can't find the button to trigger pipeline, not sure if I'm just not looking in the right place or if I don't have permissions for that...

The ones with "Other (Not Open)" are I think all coming from Improvement Service - the count matches what they have with that string for licence. Possibly want to add some condition to exclude those when pulling from there?

ormiret avatar Dec 09 '22 16:12 ormiret

I've added a script to report what's coming through with "Custom licence:" from not matching anything in known_licences.

That's currently giving me:

Custom licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/    398
Custom licence: Other (Not Open)                                                              45
Custom licence: http://rightsstatements.org/vocab/NKC/1.0/                                     3
Custom licence: other-closed                                                                   1
Custom licence: http://opendatacommons.org/licenses/odbl/1-0/                                 1
Custom licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/      1

The "other-closed" is the only one not already covered above, that is this

ormiret avatar Dec 09 '22 17:12 ormiret