nutch icon indicating copy to clipboard operation
nutch copied to clipboard

fix for NUTCH-2455 more efficient usage of hostdb in generate

Open okedoki opened this issue 7 years ago • 6 comments

Three questions/modification left open:

  1. In several places we use url.getHost() in the Nutch code, in other we use url.getHost().toLower(). Why?
  2. public static class ScoreHostKeyComparator extends WritableComparator should Implement Raw comparator. If you know how to do it you are welcome to do.
  3. The whole Generator file is to big, it should be spread to several files. Again, if you know how to fix it in a good way, you are welcome.

okedoki avatar Dec 08 '17 16:12 okedoki

Please review only fix for NUTCH-2455 more efficient usage of hostdb in generate(c1ce018d93aac482e98634c581efa4188cdde053)

The "added id to output files" is not correct commit, I have reverted it.

okedoki avatar Dec 13 '17 10:12 okedoki

I found a bug with partitioned that prevents to get correct hostdb data to the correct reducer. It is fixed. The second, I have applied the Eclipse auto-formatting as suggested by @lewismc .

For some reasons, I have a conflict with Generator from master. I assume it happened because of autoformating, so instead of correct comparison it shows that the whole code of Generator is replaced.

What is the rule for fixing in this case?

okedoki avatar Dec 28 '17 10:12 okedoki

Mmmm OK @okedoki we need to resolve this conflict. The issue here is that you have indented everything by 4 spaces by the looks of it. This is incorrect as indenting accoridng to the code formatting template is 2 space indents. Please update the ppull request again if you could. Thanks

lewismc avatar Dec 30 '17 01:12 lewismc

@lewismc Finally, I managed to solve the merging conflict. Please review before Generator will be modified again.

okedoki avatar Jan 18 '18 12:01 okedoki

@okedoki thank you very much, this is a big patch and we need to test it out.

lewismc avatar Jan 18 '18 17:01 lewismc

There was a silly bug that didnt copy hostdb correctly in reducer because of copy-by-reference. My bad, fixed with clone.

hostDomainCounts.put(key.second.toString(), new MutablePair<HostDatum, int[]>((HostDatum) hostDatum.clone(), new int []{1,0})); at line 484

okedoki avatar Jan 25 '18 14:01 okedoki