nutch
nutch copied to clipboard
fix for NUTCH-2455 more efficient usage of hostdb in generate
Three questions/modification left open:
- In several places we use url.getHost() in the Nutch code, in other we use url.getHost().toLower(). Why?
- public static class ScoreHostKeyComparator extends WritableComparator should Implement Raw comparator. If you know how to do it you are welcome to do.
- The whole Generator file is to big, it should be spread to several files. Again, if you know how to fix it in a good way, you are welcome.
Please review only fix for NUTCH-2455 more efficient usage of hostdb in generate(c1ce018d93aac482e98634c581efa4188cdde053)
The "added id to output files" is not correct commit, I have reverted it.
I found a bug with partitioned that prevents to get correct hostdb data to the correct reducer. It is fixed. The second, I have applied the Eclipse auto-formatting as suggested by @lewismc .
For some reasons, I have a conflict with Generator from master. I assume it happened because of autoformating, so instead of correct comparison it shows that the whole code of Generator is replaced.
What is the rule for fixing in this case?
Mmmm OK @okedoki we need to resolve this conflict. The issue here is that you have indented everything by 4 spaces by the looks of it. This is incorrect as indenting accoridng to the code formatting template is 2 space indents. Please update the ppull request again if you could. Thanks
@lewismc Finally, I managed to solve the merging conflict. Please review before Generator will be modified again.
@okedoki thank you very much, this is a big patch and we need to test it out.
There was a silly bug that didnt copy hostdb correctly in reducer because of copy-by-reference. My bad, fixed with clone.
hostDomainCounts.put(key.second.toString(), new MutablePair<HostDatum, int[]>((HostDatum) hostDatum.clone(), new int []{1,0})); at line 484