wikihadoop
wikihadoop copied to clipboard
Stream-based InputFormat for processing the compressed XML dumps of Wikipedia with Hadoop
Hello , I am running the jar files on hadoop-1.0.4 against the latest uncompressed 42 GB dump. The following is an exception I am getting [ravisg@topsail-sn bin]$ hadoop jar /var/hadoop-1.0.4/contrib/streaming/hadoop-streaming-1.0.4.jar...
Revisions around a page ending can be duplicated in the results when bzip2 input is used.
The percentage of the progress usually does't go up until 100%, even when all the data is processed. The cause of this problem is that the current implementation calculate the...
The splitting function essentially should work for more general usecases where you want to split an XML without splitting a certain element. We could provide a separate class for splitting.
We plan to provide a simple way to install both this Hadoop input format and [the diff making tool](http://svn.wikimedia.org/svnroot/mediawiki/trunk/tools/wsor/diffs/).