wappalyzer icon indicating copy to clipboard operation
wappalyzer copied to clipboard

Focus on high scale technologies and improve rules confidence

Open max-ostapenko opened this issue 9 months ago • 0 comments

@rviscomi @pmeenan @tunetheweb to wrap the topic of maintenance efforts... Is it any helpful idea?

Is your feature request related to a problem? Please describe. Currently the list of technologies is grown to more than 3K entries.

In order to continue improving scale and quality of insights provided by HTTP Archive crawls it may be better to focus on the most impactful tech.

Describe the solution you'd like

  1. HTTP Archive team could define the profile of technologies that serves the goals of the crawls.

  2. Here is an example of analysis that could be quickly verified on a monthly basis:

  • changes in tech popularity (may be due to tech lifecycle, degraded rules freshness, quality, completeness),
  • no rules are maintained (deprecated?) that are below a threshold of popularity.

E.g.:

WITH tech_report AS (
  SELECT
    tech.technology,
    COUNT(DISTINCT IF(date = "2024-03-01", root_page, NULL)) AS pages_20240301,
    COUNT(DISTINCT IF(date = "2024-04-01", root_page, NULL)) AS pages_20240401
  FROM `httparchive.all.pages` AS t
  CROSS JOIN UNNEST (t.technologies) AS tech
  WHERE
    date >= "2024-03-01"
    AND client = 'desktop'
    AND is_root_page = TRUE
  GROUP BY 1
),
tech_list AS (
  SELECT
    DISTINCT name AS technology
  FROM `max-ostapenko.wappalyzer.apps` -- to migrate to httparchive project
)

SELECT
  COALESCE( tech_list.technology, tech_report.technology ) AS technology,
  pages_20240301,
  pages_20240401,
  ROUND(1-SAFE_DIVIDE(pages_20240301,pages_20240401), 2) AS diff_perc,
  IF((pages_20240401 <= 100 OR pages_20240401 IS NULL), TRUE, FALSE) AS low_reach
FROM tech_list
FULL OUTER JOIN tech_report
ON tech_list.technology = tech_report.technology
ORDER BY
  pages_20240301 DESC,
  pages_20240401 ASC

Obviously the final reports should be actionable (example). And probably extend to 3-4 month to increase confidence.

Additional context

  1. Assists with analysis of particularly noticeable web trends.
  2. Makes issues more visible.
  3. A bit faster tech detection in crawls.
  4. BQ tech list table can be updated on PR merge

max-ostapenko avatar May 12 '24 22:05 max-ostapenko