whotracks.me Clarification on how to get a list of domains associated with fingerprinting

I've been working through the various data options trying to compile a list of domains that classify as fingerprinting. I'm getting mixed results and wondering if you can clarify, what you think of as the canonical approach.

Apologies if I'm just misreading the documentation. I'm happy to submit a PR to docs if you think it would be useful.

I can use the data source, and get a list of tracker ids as follows

fp_trackers = set()
regions = {'de', 'eu', 'fr', 'global', 'us'}
for region in regions:
    who_tracks_data = DataSource(region=region)
    who_tracks_fp = who_tracks_data.trackers.df[who_tracks_data.trackers.df.bad_qs > 0.1]
    fp_trackers.update(list(who_tracks_fp.tracker.values))

This gives me 193 trackers. I can then map this to domains using the map from create_tracker_map.

could_not_find = []
domains = set()
for tracker in fp_trackers:
    try:
        domains.update(tracker_info['trackers'][tracker]['domains'])
    except KeyError:
        could_not_find.append(tracker)

This will give me 326 domains.

If I take a different route, and read in all the csv files under assets folders labeled domains.csv, I can get a list of domains like this

domains_df = pd.concat([
    pd.read_csv(file, parse_dates=['month'])
    for file in asset_paths['domains'] # I have previously assembled all the paths
])
fingerprinting_trackers = domains_df[domains_df.bad_qs > 0.1].host_tld.unique()

But this gives me a list of 292 domains.

I can think of an explanation for this - not all host_tld's might have a bad_qs that meets the threshold but they've been added to the tracker map for other reasons.

However, given that the other csv files may also be relevant, I was starting to lose confidence and so wanted to check in.

Many thanks in advance for your help.

Oct 20 '18 07:10 birdsarah

The domains.csv and trackers.csv files represent different aggregations of the same data. If we consider the fingerprinting case:

domains.csv counts the proportion of times when each hostname (at TLD+1 level) was seen sending a fingerprint (or suspected fingerprint) in a third-party context on a page.
trackers.csv counts the proportion for any of the hostnames associated with a tracker - from the mapping in the tracker database.

For the majority of trackers the relationship between domains and trackers is one-to-one. For others the domains files will show to which domains fingerprinting data is sent, while the trackers view shows a more aggregated picture of what the tracker is doing.

For example, Facebook uses facebook.net as a CDN, and we can see from the stats little evidence of tracking on this domain. The tracking requests are aimed at facebook.com where they have the user's login cookie. In the tracker view we report the aggregate view of both domains, which shows the aggregate view of Facebook's third-party traffic.

I hope that clears things up a little for you. From your use-case it looks like the domains.csv data view would fit better.

Oct 22 '18 08:10 sammacbeth

Hi @birdsarah, many thanks for the PR and issues raised. domains.csv is currently not exposed via the API. If you'd find this useful, you can add this to loader.py, extending our API. Here's one way to do it:

class Domains(PandasDataLoader):
    def __init__(self, data_months, region="global"):
        super().__init__(data_months, name="domains", region=region)

then add this to class DataSource, still on loader.py

       ...
        self.domains = Domains(
            data_months=self.data_months,
            region=region
        )

Now you can consume domains via the DataSource:

data = DataSource(region="global")
domains = data.domains.df

where domains would be a pandas dataframe of all months for which domains.csv is available.

Oct 22 '18 12:10 ecnmst

Thanks so much for this feedback @sammacbeth @ecnmst. This is extremely helpful. I'll leave this open and plan to make the addition to loader.py that @ecnmst proposes.

Oct 22 '18 18:10 birdsarah

But if, on reflection, you don't want the update to loader.py feel free to close the issue.

Oct 22 '18 18:10 birdsarah

whotracks.me whotracks.me copied to clipboard

Clarification on how to get a list of domains associated with fingerprinting

whotracks.me
whotracks.me copied to clipboard