Innovation-Data-Processing-Scripts icon indicating copy to clipboard operation
Innovation-Data-Processing-Scripts copied to clipboard

A shared repository for data cleaning scripts used for innovation data.

I3 Shared Data Processing Scripts

This is shared repository for data processing scripts, with a focus on innovation-related data. 'Processing' in this context could refer to a number of different operations, including (but not limited to):

  • normalisation
  • disambiguation and entity reconciliation
  • web scraping
  • parsing web-scraped data
  • transformation/merging different datasets together
  • standardising datasets
  • deduplication

Adding to the catalog

If you'd like to link some data processing scripts, or upload some, please take a look at our contribution guidelines, and make a pull request using a pull request template. Links to external repositories are added below; uploaded scripts get their own folder.

Using code from this repository

Each separate folder here contains a repository of data processing scripts (or, more commonly a link to one plus a description), contributed by a member of the community. Each repository listed here should be documented to a standard that will let you know how and on what to run it. If you have problems with code files that are hosted in this repository directly, please open a github issue, or a pull request if you correct the issue and would like to amend the documentation. If you're having trouble with an external repository that is linked to by a URL, then raise an issue in that repository.

Patent data

Graph visualizations

Scholarly + scientific data

Benchmarks and other meta-datasets

  • Alaska: A data pipeline benchmark, with profiling data

Other (to review)

  • the Allen NLP Guide - general-purpose
  • linked-uspto-patent-data (rdf), forward43 (social innovation)