almanac.httparchive.org icon indicating copy to clipboard operation
almanac.httparchive.org copied to clipboard

Privacy 2024 queries

Open max-ostapenko opened this issue 10 months ago • 5 comments

Analysis plan details

Queries

Bounce tracking:

  • [x] number_of_websites_with_bounce_tracking.sql

CNAME

  • [x] most_common_cname_domains.sql

IAB consent frameworks:

  • [x] most_common_countries_for_iab_tcf_v2.sql
  • [x] most_common_referrer_policy.sql
  • [x] most_common_strings_for_iab_usp.sql
  • [x] number_of_websites_with_iab.sql

GPC prevalence:

  • [x] number_of_websites_with_gpc.sql

CMPs presence

  • [x] most_common_cmps_for_iab_tcf_v2.sql

ads.txt & sellers.json:

  • [x] ads_and_sellers_graph.sql
  • [x] ads_lines_amount.sql
  • [x] ads_seller_accounts_by_type.sql
  • [x] common_ads_variables.sql
  • [x] top_direct_sellers.sql

Privacy Sandbox:

  • [x] number_of_websites_with_related_origin_trials.sql
  • [x] privacy-sandbox-adoption-by-third-parties-by-publishers.sql
  • [x] number_of_privacy_sandbox_attested_domains.sql
  • [x] number_of_ara_destinations_registered_by_third_parties_and_publishers.sql
  • [x] top_ara_destinations_registered_by_most_publishers.sql
  • [x] top_ara_destinations_registered_by_most_third_parties.sql

CCPA:

  • [x] ccpa_most_common_phrases.sql
  • [x] ccpa_prevalence.sql

Fingerprinting:

  • [x] fingerprinting_most_common_apis.sql
  • [x] fingerprinting_most_common_scripts.sql
  • [x] fingerprinting_script_count.sql

Cookies:

  • [x] cookies_top_first_party.sql
  • [x] cookies_top_third_party.sql

Other:

  • [x] number_of_websites_with_dnt.sql
  • [x] most_common_client_hints.sql
  • [x] number_of_websites_per_tracking_technology.sql
  • [x] number_of_websites_with_client_hints.sql
  • [x] number_of_websites_with_privacy_service.sql
  • [x] number_of_websites_with_referrerpolicy.sql
  • [x] number_of_websites_with_related_origin_trials.sql
  • [x] number_of_websites_with_whotracksme_trackers.sql
  • [x] easylist_tracker_detection.sql

Functions

  • [x] httparchive.fn.DECODE_ORIGIN_TRIAL
  • [x] httparchive.fn.PARSE_ORIGIN_TRIAL

Scripts

  • [x] ads_parser.py - Parse and evaluate Google's ads.txt that weights >=100 MB
  • [x] populate_easylist_adserver.py
  • [x] whotracksme_trackers.py updated

max-ostapenko avatar May 03 '24 14:05 max-ostapenko

@max-ostapenko

  • Yeah, this probably should’ve had a limit on it. For visualization, though, we can just take the top 5-10 from each grouping to display.
  • Do you have any ideas for figuring out which URL that happens on? Not sure how that could happen unless the custom_metrics object is malformed, but I could just add a try/catch to ignore those cases.

bstandaert-wustl avatar Aug 17 '24 12:08 bstandaert-wustl

  • Do you have any ideas for figuring out which URL that happens on? Not sure how that could happen unless the custom_metrics object is malformed, but I could just add a try/catch to ignore those cases.

Return an error description within catch in the UDF. You'll be able to see the scale of the issue in this query results. And also debug individual URLs by filtering error description strings:

SELECT client, fingerprinting_type, page
FROM pages
WHERE fingerprinting_type LIKE '%Error%'

max-ostapenko avatar Aug 17 '24 13:08 max-ostapenko

@hadiamjad could you please add a query you used to create Disconnect reports. Did you update easylist-tracker-detection.sql for this?

max-ostapenko avatar Oct 01 '24 18:10 max-ostapenko

@max-ostapenko RE your review request, I don't have time to review all the queries; is there something specific you want me to look at?

bstandaert-wustl avatar Oct 01 '24 19:10 bstandaert-wustl

@bstandaert-wustl sure, please take a look at bounce tracking, CNAME and something from Privacy Sandbox.

max-ostapenko avatar Oct 01 '24 19:10 max-ostapenko