ideas
ideas copied to clipboard
Tools and Workflows for Repeatable Sharable Data Cleaning / ETL / Processing
- drake - https://github.com/Factual/drake/
- scraperwiki X
- Data Explorer - http://explorer.okfnlabs.org/
- shell scripts
- https://github.com/rgrp/dataset-gla/blob/master/scripts/clean.sh
- ETL toolchains such as ... ?
I'm building ScraperWikiX at https://github.com/rossjones/ScraperWikiX/
Open Refine recipes/vignettes, especially for "standardised" data formats? eg http://schoolofdata.org/2013/07/26/using-openrefine-to-clean-multiple-documents-in-the-same-way/
https://github.com/OpenRefine/OpenRefine
Apache OODT? http://oodt.apache.org/ Check out DRAT (Distributed Release Audit Tool) as an example of OODT ETL in action: http://github.com/chrismattmann/drat.git
tuttle is also as tool for repeatable workflow that is very friendly with team collaboration, and continuous integration (like jenkins for updating data every hour, for example)