wikihadoop Duplicated revision pairs when bzip2 input is used

Duplicated revision pairs when bzip2 input is used

Open whym opened this issue 13 years ago • 2 comments

Revisions around a page ending can be duplicated in the results when bzip2 input is used.

Aug 16 '11 17:08 whym

Hiya! I'm talking to Aaron Halfaker right now! We are thinking about using this again. Is this still an issue? He seems to remember you guys resolving this.

Oct 09 '14 18:10 ottomata

I believe it is, although the duplicates shouldn't be too many. Change "<=" in the last assertion in testSplitCompressed() to "==", and it won't pass (while it ideally should). According to the error I get there, the scale of duplicates looks like this: "expected: 93939, found: 93946".

The problem is in the way bzip files can be split - splits must be aligned to bzip2 blocks, which might end at in the middle of a revision. To not lose any revision, I had to implement to cover some revisions doubly.

It might make sense to solve this by adding another layer of a Hadoop job to remove duplicates in the larger workflow. (Looking back, I have a very vague memory discussing solving it more neatly, but anyway it wasn't implemented at the end.)

Oct 10 '14 14:10 whym

wikihadoop wikihadoop copied to clipboard

Duplicated revision pairs when bzip2 input is used

wikihadoop
wikihadoop copied to clipboard