dkpro-c4corpus Questions on statistics

I've been trying to wrap my head around the overall process and understand the numbers associated. The questions below are things that I can't figure out:

Why are the CleanEval results different for the Java & Python implementations if it's the same algorithm?
The Phase 1 stats are inconsistent. The text says 22 hours, but the pasted log says 10.5 hrs.
The Phase 1 log says there were 34901 map tasks, which is suspiciously close to the number of files in the CC-MAIN-2016-07 crawl, not the 2015-48 crawl. Are these stats for a different crawl than the others?
Phase 1 mapper output records is 1.1 billion which is significantly lower than the 1.73B (or 1.82B) URLs listed for the crawl. That seems like too big a difference to be accounted for by content type filters (or is my perception wrong?). Is it known what factors contribute to this delta?
The paper says that the there were only ~1% duplicates in the Common Crawl, but the Phase 2 reducer (exact duplicates filter) appears to have only output 39% of the input records (ie it filtered 60%+). Am I misunderstanding the stats or is this the actual number of exact duplicates.
The Phase 1 stats seem to indicate that a significant amount (40%) of time was spent in the shuffle phase, but it doesn't look like the reducer actually does anything. Could Phase 1 be implemented as a map only job? Conversely, could Phase 1 & Phase 2 be merged so that the reducer actually does useful work?
The Phase 3 Step 3 stats for Tuples Creation (36 hrs, 7104 normalized instance hours) seem to indicate that very few instances were used for this phase. Is that an accurate observation? Would more instances reduce the elapsed time?
Are there stats on how many near-duplicate documents were eliminated in Phase 3/4?

Thanks for any answers/insights you can offer!

Mar 22 '16 01:03 tfmorris

Why are the CleanEval results different for the Java & Python implementations if it's the same algorithm?

As far as I remember, Python has a library for parsing HTML which has no direct equivalent in Java; so it's only an approximation (maybe @OmniaZayed can say a word?)

The Phase 1 stats are inconsistent. The text says 22 hours, but the pasted log says 10.5 hrs. The Phase 1 log says there were 34901 map tasks, which is suspiciously close to the number of files in the CC-MAIN-2016-07 crawl, not the 2015-48 crawl. Are these stats for a different crawl than the others? Phase 1 mapper output records is 1.1 billion which is significantly lower than the 1.73B (or 1.82B) URLs listed for the crawl. That seems like too big a difference to be accounted for by content type filters (or is my perception wrong?). Is it known what factors contribute to this delta?

I was copy&pasting output from the 2016-07 (!!) crawl processing, it will be cleared later (I had some other article deadlines but was curious to run it on the new crawl); I don't have Hadoop logs for the 2015 crawl as reported in the article.

The paper says that the there were only ~1% duplicates in the Common Crawl, but the Phase 2 reducer (exact duplicates filter) appears to have only output 39% of the input records (ie it filtered 60%+). Am I misunderstanding the stats or is this the actual number of exact duplicates.

I have to investigate that deeper...

The Phase 1 stats seem to indicate that a significant amount (40%) of time was spent in the shuffle phase, but it doesn't look like the reducer actually does anything. Could Phase 1 be implemented as a map only job? Conversely, could Phase 1 & Phase 2 be merged so that the reducer actually does useful work?

Yep, reducer now does nothing in Phase1. Phase1 & 2 can be easily merged (as the work is done in Mapper in Phase1 and duplicates are removed in Reducer in Phase 2). We wanted to keep them separate thought for getting statistics of exact match.

The Phase 3 Step 3 stats for Tuples Creation (36 hrs, 7104 normalized instance hours) seem to indicate that very few instances were used for this phase. Is that an accurate observation? Would more instances reduce the elapsed time?

As I experimented with different number of instances and also different price for spot instances, this was a bit fiddling around... I might do proper measurement for the 2016-07 crawl.

Are there stats on how many near-duplicate documents were eliminated in Phase 3/4?

not yet, maybe later for the 2016-07 data

Mar 22 '16 12:03 habernal

Why are the CleanEval results different for the Java & Python implementations if it's the same algorithm?

Yes, as Ivan said, but not only due to the different library to parse the HTML page into DOM but also due to the library used to divide the page into paragraphs.

The python implementation use a library to convert a HTML page represented as a DOM object into a list of paragraphs which is called "lxml.sax". This library controls how the HTML page will be segmented into paragraphs. Since Java doesn't have a similar accurate library, we had to implement a similar way to traverse the HTML page and render the paragraphs. These paragraphs will be classified further as good or bad text and each paragraph will take into account the classification of its neighbour paragraphs.

Mar 22 '16 12:03 OmniaZayed

Thanks for the quick answers! I'll leave this open to learn the results of the exact duplicates investigation, but I'm happy with everything else. One of the reasons I'm interested in the times is that time/cost is one of the questions that people always ask about CommonCrawl, so it's nice to have some example data points that people can use.

A couple suggestions:

converting the Phase 1 job to a map-only job would allow you to keep things separate without wasting the 4+ hrs that the shuffle phase required. That would cut processing time almost in half.
If you do decide to combine Phase 1 & Phase 2, you can use Hadoop counters to output interest stats, like the number of exact matches. context.getCounter(EXACT_MATCHES).increment(1L)

Thanks again for the fast response.

Mar 22 '16 15:03 tfmorris

In case it's helpful, here are some stats from a sample segment that I was testing hashing code on:

154,000 total WARC records 32,695 HTTP response with non-empty text after boilerplate extraction (21% of total) 17,284 empty responses after extraction (not included above) 1,442 with non-HTML/text content type (not included above) 19,704 exact duplicates (60% of texts) in 6,216 sets, 139 pages in biggest set of duplicates, 63 in 2nd biggest 1,236 near duplicates (Hamming dist < 3)

Even though it's only a single segment that I'm testing with, so could obviously be skewed, the numbers track fairly closely with those from the full 2016-07 crawl.

One thing I still need to verify is that the "exact duplicates" from the Simhash comparisons really are identical to make sure there are no errors in my hashing.

Mar 27 '16 22:03 tfmorris

Intersting findings Tom, thanks! It shows that exact duplicates after boilerplate removal are more common (60%), while near duplicates are a minor problem (10% of the remaining ones). Still lots of duplicate texts, though.

Mar 28 '16 08:03 habernal