behemoth icon indicating copy to clipboard operation
behemoth copied to clipboard

switch to new Hadoop API

Open jnioche opened this issue 13 years ago • 3 comments

jnioche avatar Mar 30 '11 09:03 jnioche

See https://github.com/butlermh/behemoth/commit/7411aa9cbd0fd1bddd61545a9a503daff5d8dcf8

It turns out updating to the new API is a bad idea, DistributedCache does not work with the new API - see https://issues.apache.org/jira/browse/MAPREDUCE-898 http://lucene.472066.n3.nabble.com/Distributed-Cache-with-New-API-td722187.html this breaks the SOLR, UIMA and GATE modules.

Also for the IO module, for WARC, not quite sure how to deal with MultiFileSplits. This has been replaced by CombineFileSplit, however it still implements the interface InputSplit, whereas the new api uses a class called InputSplit. So not clear how this needs to change either.

butlermh avatar May 31 '11 11:05 butlermh

See also http://autofei.wordpress.com/2011/04/07/distributedcache-incompleted-guide/

butlermh avatar May 31 '11 13:05 butlermh

In the end, I did manage to find a way of doing this, except for WARC - see https://github.com/butlermh/behemoth/commit/97150bd579ae74eefacae85422937698f2c72445

butlermh avatar Jun 06 '11 17:06 butlermh