hadoop-video-ocr icon indicating copy to clipboard operation
hadoop-video-ocr copied to clipboard

MapReduce job for text recognition from YouTube videos

Hadoop OCR

MapReduce job for text recognition from YouTube videos.

Installation

Tested on Ubuntu 14.04, but highly recommend to use Ubuntu 14.10.

  • If you are using Ubutnu 14.04 or lower, add Ubuntu 14.10 repository to sources.list. We need it to install latest opencv 2.4.9 libraries.
echo 'deb http://mirrors.kernel.org/ubuntu utopic main universe' >> /etc/apt/sources.list
sudo apt-get update
  • Install tesseract

    sudo apt-get install tesseract-ocr

  • Install libopencv-dev 2.4.9

    sudo apt-get install libopencv-dev

  • Install Hadoop 2.6.0, CDH5 recommended

  • Checkout reposotory and build jar with mvn clean install

  • Create initial HDFS structure

    sudo -u hdfs hadoop fs -mkdir /user/ubuntu
    sudo -u hdfs hadoop fs -chown ubutnu /user/ubutnu
    
  • Create input and set input file

    scripts/set_url.sh <youtube_url>
    
  • Run MapReduce job

    scripts/clean.sh && scripts/run.sh <absolute_path_to_jar> && scripts/result.sh