cc-crawl-statistics icon indicating copy to clipboard operation
cc-crawl-statistics copied to clipboard

Host counts can be off by +1 (higher than actual)

Open handecelikkanat opened this issue 4 months ago • 1 comments

cc-crawl-statistics sometimes can report host counts as one more than actual number. This behavior is sporadic and doesnt always happen.

Example:

In domains-top-500.csv for CC-MAIN-2025-30:

domain actual host count host count as reported by cc-crawl-statistics
wikipedia.org 671 672: buggy report
pinterest.com 50 50: correctly reported

handecelikkanat avatar Aug 15 '25 14:08 handecelikkanat

To efficiently count the number of hosts, the assumption is made that all URLs belonging to a single host are in the same input shard. The input are the 300 CDX index shards, sorted (totally) by SURT URL which enforces that all URLs of one host (same for pay-level domains and TLDs) are in one contiguous range of the index. However, the contiguous range may be spread over multiple shards. In this case, URLs from the same host can be in two shards and the same host is counted twice. This explains why the counts for some hosts are of by +1. By chance, it's more likely that a host with many captures are affected.

With 300 shards, at maximum 299 host counts can be off. There are about 50 million hosts counted every month:

  • It's a marginal issue, if max. 299 counts are off by +1
  • A fix would be difficult to implement and computationally expensive.

sebastian-nagel avatar Aug 18 '25 13:08 sebastian-nagel