ideas icon indicating copy to clipboard operation
ideas copied to clipboard

Tools and Workflows for Repeatable Sharable Data Cleaning / ETL / Processing

Open rufuspollock opened this issue 11 years ago • 5 comments

  • drake - https://github.com/Factual/drake/
  • scraperwiki X
  • Data Explorer - http://explorer.okfnlabs.org/
  • shell scripts
    • https://github.com/rgrp/dataset-gla/blob/master/scripts/clean.sh
  • ETL toolchains such as ... ?

rufuspollock avatar Jul 19 '13 18:07 rufuspollock

I'm building ScraperWikiX at https://github.com/rossjones/ScraperWikiX/

rossjones avatar Jul 21 '13 19:07 rossjones

Open Refine recipes/vignettes, especially for "standardised" data formats? eg http://schoolofdata.org/2013/07/26/using-openrefine-to-clean-multiple-documents-in-the-same-way/

psychemedia avatar Aug 06 '13 09:08 psychemedia

https://github.com/OpenRefine/OpenRefine

webysther avatar Sep 30 '15 21:09 webysther

Apache OODT? http://oodt.apache.org/ Check out DRAT (Distributed Release Audit Tool) as an example of OODT ETL in action: http://github.com/chrismattmann/drat.git

chrismattmann avatar Oct 01 '15 06:10 chrismattmann

tuttle is also as tool for repeatable workflow that is very friendly with team collaboration, and continuous integration (like jenkins for updating data every hour, for example)

lexman avatar Oct 07 '15 15:10 lexman