Host counts can be off by +1 (higher than actual)
cc-crawl-statistics sometimes can report host counts as one more than actual number. This behavior is sporadic and doesnt always happen.
Example:
In domains-top-500.csv for CC-MAIN-2025-30:
| domain | actual host count | host count as reported by cc-crawl-statistics |
|---|---|---|
| wikipedia.org | 671 | 672: buggy report |
| pinterest.com | 50 | 50: correctly reported |
To efficiently count the number of hosts, the assumption is made that all URLs belonging to a single host are in the same input shard. The input are the 300 CDX index shards, sorted (totally) by SURT URL which enforces that all URLs of one host (same for pay-level domains and TLDs) are in one contiguous range of the index. However, the contiguous range may be spread over multiple shards. In this case, URLs from the same host can be in two shards and the same host is counted twice. This explains why the counts for some hosts are of by +1. By chance, it's more likely that a host with many captures are affected.
With 300 shards, at maximum 299 host counts can be off. There are about 50 million hosts counted every month:
- It's a marginal issue, if max. 299 counts are off by +1
- A fix would be difficult to implement and computationally expensive.