cascading
cascading copied to clipboard
Add Platform test to show Hadoop and Tez partitioning difference
Noticed this on one of our test jobs that we were using to compare the performance of MR and Tez. I've built a unit test to show a subset of the graph where Cascading on Hadoop is combining more nodes and thus lowering the quantity of data streamed between nodes / steps.
The job starts off with two vertices V0, V1 reading around 3,025,369,753 tuples (10 odd TB). They're then merged + grouped in vertex V2. This is then passed on to Vertex V3 which performs some aggregations (everys) and reduces the data to around 1 TB.
In case of Hadoop, V0, V1 are done on the job's mappers. V2 + V3 are combined and done on the reducers. We then end up writing out this 1TB or so of data and that's picked up by the downstream steps.
Wondering if we should have a rule to collapse these aggregations into the step doing the groupBy?
Leaving this open in the hope I have time to look into it, even though it's likely no longer a concern.