httparchive.org
httparchive.org copied to clipboard
Reduce number of unidentified resource types
44M (4%) resources in the latest crawl were unidentified, having the type other. I'd like to investigate how many of these could be mapped to known types using the available metadata.
| type | freq | pct |
|---|---|---|
| image | 468,022,575 | 42.53% |
| script | 319,511,063 | 29.03% |
| css | 114,376,167 | 10.39% |
| html | 87,514,158 | 7.95% |
| font | 53,919,832 | 4.90% |
| other | 44,428,455 | 4.04% |
| text | 7,024,852 | 0.64% |
| video | 3,568,673 | 0.32% |
| audio | 1,115,662 | 0.10% |
| xml | 1,089,915 | 0.10% |
SELECT
_TABLE_SUFFIX AS client,
type,
COUNT(0) AS freq
FROM
`httparchive.summary_requests.2020_08_01_*`
GROUP BY
client,
type
ORDER BY
freq DESC
Half of them are redirects, another 20% are 204s (i.e. no content).
SELECT
_TABLE_SUFFIX AS client,
status,
COUNT(0) AS freq,
COUNT(0) / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS pct,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT url LIMIT 5), ' ') AS sample_urls
FROM
`httparchive.summary_requests.2021_12_01_*`
WHERE
type = 'other'
GROUP BY
client,
status
ORDER BY
freq DESC
| Row Labels | desktop | mobile |
|---|---|---|
| 302 | 52% | 50% |
| 204 | 20% | 21% |
| 200 | 17% | 17% |
| 307 | 3% | 3% |
| 303 | 3% | 3% |
| 0 | 2% | 3% |
| 101 | 2% | 2% |
| 301 | 1% | 1% |
| 304 | 0% | 0% |
| 202 | 0% | 0% |
| 206 | 0% | 0% |
That leaves about 17% that should be classified (200s). Checking out a few of them, about two thirds are ones where the server does not return a response type, and the URL doesn't have an extension so they really are difficult to classify.
The other third are a mixture of octet streams, PDFs, and OCSP responses. These could be classified but getting down to small amounts (33% of 17% of 4% = 0.22% of total requests), so unlikely to make a material difference.