whotracks.me
whotracks.me copied to clipboard
Clarification on how to get a list of domains associated with fingerprinting
I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.
Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.
I can use the data source, and get a list of tracker ids as follows
fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
who_tracks_data = DataSource(region=region)
who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
fp_trackers.update(list(who_tracks_fp.tracker.values))
This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.
could_not_find = []
domains = set()
for tracker in fp_trackers:
try:
domains.update(tracker_info['trackers'][tracker]['domains'])
except KeyError:
could_not_find.append(tracker)
This will give me 326 domains.
If I take a different route, and read in all the csv files under assets folders labeled domains.csv
, I can get a list of domains like this
domains_df = pd.concat([
pd.read_csv(file, parse_dates=['month'])
for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()
But this gives me a list of 292 domains.
I can think of an explanation for this - not all host_tld's might have a bad_qs
that meets the threshold but they've been added to the tracker map for other reasons.
However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.
Many thanks in advance for your help.
The domains.csv
and trackers.csv
files represent different aggregations of the same data. If we consider the fingerprinting case:
-
domains.csv
counts the proportion of times when each hostname (at TLD+1 level) was seen sending a fingerprint (or suspected fingerprint) in a third-party context on a page. -
trackers.csv
counts the proportion for any of the hostnames associated with a tracker - from the mapping in the tracker database.
For the majority of trackers the relationship between domains and trackers is one-to-one. For others the domains files will show to which domains fingerprinting data is sent, while the trackers view shows a more aggregated picture of what the tracker is doing.
For example, Facebook uses facebook.net
as a CDN, and we can see from the stats little evidence of tracking on this domain. The tracking requests are aimed at facebook.com
where they have the user's login cookie. In the tracker view we report the aggregate view of both domains, which shows the aggregate view of Facebook's third-party traffic.
I hope that clears things up a little for you. From your use-case it looks like the domains.csv
data view would fit better.
Hi @birdsarah, many thanks for the PR and issues raised. domains.csv
is currently not exposed via the API. If you'd find this useful, you can add this to loader.py
, extending our API. Here's one way to do it:
class Domains(PandasDataLoader):
def __init__(self, data_months, region="global"):
super().__init__(data_months, name="domains", region=region)
then add this to class DataSource
, still on loader.py
...
self.domains = Domains(
data_months=self.data_months,
region=region
)
Now you can consume domains via the DataSource
:
data = DataSource(region="global")
domains = data.domains.df
where domains
would be a pandas dataframe of all months for which domains.csv
is available.
Thanks so much for this feedback @sammacbeth @ecnmst. This is extremely helpful. I'll leave this open and plan to make the addition to loader.py that @ecnmst proposes.
But if, on reflection, you don't want the update to loader.py feel free to close the issue.