httparchive.org Reduce number of unidentified resource types

Reduce number of unidentified resource types

Open rviscomi opened this issue 5 years ago • 1 comments

44M (4%) resources in the latest crawl were unidentified, having the type other. I'd like to investigate how many of these could be mapped to known types using the available metadata.

type	freq	pct
image	468,022,575	42.53%
script	319,511,063	29.03%
css	114,376,167	10.39%
html	87,514,158	7.95%
font	53,919,832	4.90%
other	44,428,455	4.04%
text	7,024,852	0.64%
video	3,568,673	0.32%
audio	1,115,662	0.10%
xml	1,089,915	0.10%

SELECT
  _TABLE_SUFFIX AS client,
  type,
  COUNT(0) AS freq
FROM
  `httparchive.summary_requests.2020_08_01_*`
GROUP BY
  client,
  type
ORDER BY
  freq DESC

Sep 11 '20 18:09 rviscomi

Half of them are redirects, another 20% are 204s (i.e. no content).

SELECT
  _TABLE_SUFFIX AS client,
  status,
  COUNT(0) AS freq,
  COUNT(0) / SUM(COUNT(0)) OVER (PARTITION BY _TABLE_SUFFIX) AS pct, 
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT url LIMIT 5), ' ') AS sample_urls
FROM
  `httparchive.summary_requests.2021_12_01_*`
WHERE
  type = 'other'
GROUP BY
  client,
  status
ORDER BY
  freq DESC

Row Labels	desktop	mobile
302	52%	50%
204	20%	21%
200	17%	17%
307	3%	3%
303	3%	3%
0	2%	3%
101	2%	2%
301	1%	1%
304	0%	0%
202	0%	0%
206	0%	0%

That leaves about 17% that should be classified (200s). Checking out a few of them, about two thirds are ones where the server does not return a response type, and the URL doesn't have an extension so they really are difficult to classify.

The other third are a mixture of octet streams, PDFs, and OCSP responses. These could be classified but getting down to small amounts (33% of 17% of 4% = 0.22% of total requests), so unlikely to make a material difference.

Jan 03 '22 13:01 tunetheweb

Row Labels	desktop	mobile
302	52%	50%
204	20%	21%
200	17%	17%
307	3%	3%
303	3%	3%
0	2%	3%
101	2%	2%
301	1%	1%
304	0%	0%
202	0%	0%
206	0%	0%

Row Labels	desktop	mobile
302	52%	50%
204	20%	21%
200	17%	17%
307	3%	3%
303	3%	3%
0	2%	3%
101	2%	2%
301	1%	1%
304	0%	0%
202	0%	0%
206	0%	0%

httparchive.org httparchive.org copied to clipboard

Reduce number of unidentified resource types

httparchive.org
httparchive.org copied to clipboard

Row Labels	desktop	mobile
302	52%	50%
204	20%	21%
200	17%	17%
307	3%	3%
303	3%	3%
0	2%	3%
101	2%	2%
301	1%	1%
304	0%	0%
202	0%	0%
206	0%	0%