behemoth
behemoth copied to clipboard
switch to new Hadoop API
See https://github.com/butlermh/behemoth/commit/7411aa9cbd0fd1bddd61545a9a503daff5d8dcf8
It turns out updating to the new API is a bad idea, DistributedCache does not work with the new API - see https://issues.apache.org/jira/browse/MAPREDUCE-898 http://lucene.472066.n3.nabble.com/Distributed-Cache-with-New-API-td722187.html this breaks the SOLR, UIMA and GATE modules.
Also for the IO module, for WARC, not quite sure how to deal with MultiFileSplits. This has been replaced by CombineFileSplit, however it still implements the interface InputSplit, whereas the new api uses a class called InputSplit. So not clear how this needs to change either.
See also http://autofei.wordpress.com/2011/04/07/distributedcache-incompleted-guide/
In the end, I did manage to find a way of doing this, except for WARC - see https://github.com/butlermh/behemoth/commit/97150bd579ae74eefacae85422937698f2c72445