httparchive.org icon indicating copy to clipboard operation
httparchive.org copied to clipboard

Reduce number of unidentified resource types

Open rviscomi opened this issue 5 years ago • 1 comments

44M (4%) resources in the latest crawl were unidentified, having the type other. I'd like to investigate how many of these could be mapped to known types using the available metadata.

type freq pct
image 468,022,575 42.53%
script 319,511,063 29.03%
css 114,376,167 10.39%
html 87,514,158 7.95%
font 53,919,832 4.90%
other 44,428,455 4.04%
text 7,024,852 0.64%
video 3,568,673 0.32%
audio 1,115,662 0.10%
xml 1,089,915 0.10%
SELECT
  _TABLE_SUFFIX AS client,
  type,
  COUNT(0) AS freq
FROM
  `httparchive.summary_requests.2020_08_01_*`
GROUP BY
  client,
  type
ORDER BY
  freq DESC

rviscomi avatar Sep 11 '20 18:09 rviscomi

Half of them are redirects, another 20% are 204s (i.e. no content).

SELECT
  _TABLE_SUFFIX AS client,
  status,
  COUNT(0) AS freq,
  COUNT(0) / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS pct, 
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT url LIMIT 5), ' ') AS sample_urls
FROM
  `httparchive.summary_requests.2021_12_01_*`
WHERE
  type = 'other'
GROUP BY
  client,
  status
ORDER BY
  freq DESC
Row Labels desktop mobile
302 52% 50%
204 20% 21%
200 17% 17%
307 3% 3%
303 3% 3%
0 2% 3%
101 2% 2%
301 1% 1%
304 0% 0%
202 0% 0%
206 0% 0%

That leaves about 17% that should be classified (200s). Checking out a few of them, about two thirds are ones where the server does not return a response type, and the URL doesn't have an extension so they really are difficult to classify.

The other third are a mixture of octet streams, PDFs, and OCSP responses. These could be classified but getting down to small amounts (33% of 17% of 4% = 0.22% of total requests), so unlikely to make a material difference.

tunetheweb avatar Jan 03 '22 13:01 tunetheweb