wappalyzer
wappalyzer copied to clipboard
Focus on high scale technologies and improve rules confidence
@rviscomi @pmeenan @tunetheweb to wrap the topic of maintenance efforts... Is it any helpful idea?
Is your feature request related to a problem? Please describe. Currently the list of technologies is grown to more than 3K entries.
In order to continue improving scale and quality of insights provided by HTTP Archive crawls it may be better to focus on the most impactful tech.
Describe the solution you'd like
-
HTTP Archive team could define the profile of technologies that serves the goals of the crawls.
-
Here is an example of analysis that could be quickly verified on a monthly basis:
- changes in tech popularity (may be due to tech lifecycle, degraded rules freshness, quality, completeness),
- no rules are maintained (deprecated?) that are below a threshold of popularity.
E.g.:
WITH tech_report AS (
SELECT
tech.technology,
COUNT(DISTINCT IF(date = "2024-03-01", root_page, NULL)) AS pages_20240301,
COUNT(DISTINCT IF(date = "2024-04-01", root_page, NULL)) AS pages_20240401
FROM `httparchive.all.pages` AS t
CROSS JOIN UNNEST (t.technologies) AS tech
WHERE
date >= "2024-03-01"
AND client = 'desktop'
AND is_root_page = TRUE
GROUP BY 1
),
tech_list AS (
SELECT
DISTINCT name AS technology
FROM `max-ostapenko.wappalyzer.apps` -- to migrate to httparchive project
)
SELECT
COALESCE( tech_list.technology, tech_report.technology ) AS technology,
pages_20240301,
pages_20240401,
ROUND(1-SAFE_DIVIDE(pages_20240301,pages_20240401), 2) AS diff_perc,
IF((pages_20240401 <= 100 OR pages_20240401 IS NULL), TRUE, FALSE) AS low_reach
FROM tech_list
FULL OUTER JOIN tech_report
ON tech_list.technology = tech_report.technology
ORDER BY
pages_20240301 DESC,
pages_20240401 ASC
Obviously the final reports should be actionable (example). And probably extend to 3-4 month to increase confidence.
Additional context
- Assists with analysis of particularly noticeable web trends.
- Makes issues more visible.
- A bit faster tech detection in crawls.
- BQ tech list table can be updated on PR merge