RefineOnSpark icon indicating copy to clipboard operation
RefineOnSpark copied to clipboard

RefineOnSpark

RefineOnSpark is a driver program to run OpenRefine jobs on the Spark cluster.

  1. Prerequsites on the cluster

  • An instance of OpenRefine is up and bind to the default localhost:3333.
  • Input files are served via HDFS, however local files are also accepted, but have to be located under the same path on all the worker nodes.
  1. Application taxonomy

TODO