wikihadoop
wikihadoop copied to clipboard
Duplicated revision pairs when bzip2 input is used
Revisions around a page ending can be duplicated in the results when bzip2 input is used.
Hiya! I'm talking to Aaron Halfaker right now! We are thinking about using this again. Is this still an issue? He seems to remember you guys resolving this.
I believe it is, although the duplicates shouldn't be too many. Change "<=" in the last assertion in testSplitCompressed() to "==", and it won't pass (while it ideally should). According to the error I get there, the scale of duplicates looks like this: "expected: 93939, found: 93946".
The problem is in the way bzip files can be split - splits must be aligned to bzip2 blocks, which might end at in the middle of a revision. To not lose any revision, I had to implement to cover some revisions doubly.
It might make sense to solve this by adding another layer of a Hadoop job to remove duplicates in the larger workflow. (Looking back, I have a very vague memory discussing solving it more neatly, but anyway it wasn't implemented at the end.)