tracker-radar-collector
tracker-radar-collector copied to clipboard
Early browser API accesses and function calls are missed
Hi! While running some pilot crawls for our current study, we found that the TRC doesn’t collect function calls or access to properties when the call/access occurs immediately after page load. Perhaps APICallCollector can’t find time to register the breakpoints. To test this issue, we have created two test pages that
- Access window.devicePixelRatio
- Call
toDataURLmethod of an HTML5 canvas element
We’ve visited the test pages using the latest version of TRC without any modification.
- Test page 1: The script is run 1000ms after the page load.
- Command:
npm run crawl -- -u "https://homes.esat.kuleuven.be/~asenol/fp-test-with-timeout/" -o ./data/ -v -f -d 'apis' - In this case, the TRC correctly intercepts the API call and property access.
- Test website 2: The script is run immediately after the page load
- Command
npm run crawl -- -u "https://homes.esat.kuleuven.be/~asenol/fp-test-without-timeout/" -o ./data/ -v -f -d 'apis' - In this case, the TRC couldn’t intercept the API call and the property access.
I hope this helps. If you need any other info, just let me know.
Hey @asumansenol , thanks for bringing this up!
I observed the same with our API collection integration test -> https://github.com/duckduckgo/tracker-radar-collector/blob/main/tests/integration/apiCollection.test.js . Which is somehow flaky because of this issue.
I suspect a race condition between API collection script setting things up (https://github.com/duckduckgo/tracker-radar-collector/blob/main/collectors/APICalls/TrackerTracker.js#L126) and scripts on the page alrady running.
This is not a huge issue for DDG use case as everything is ready before 3p request load and execute in most cases, plus we operate on a huge sample of sites, but I can see how this is not precise enough for other use cases.
I suspect this is fixable - I'll give it a shot next week and let you know.
Sorry, still no solution to this. @muodov is updating APICollector for a better attribution (https://github.com/duckduckgo/tracker-radar-collector/pull/90), but it doesn't seem to have an effect on this issue. I suspect the solution here is to block scripts from running before all collectors are fully set up. This can be done e.g. via Debugger.pause as soon as page starts loading.
There seems to be a problem with RequestCollector and latest chromium as well, I'm currently investigating, but don't have a concrete solution yet
I think this is basically the same problem as described in https://github.com/puppeteer/puppeteer/issues/8507. This was fixed in puppeteer last year, but unfortunately it is incompatible with our current CDP usage, as I mentioned in https://github.com/duckduckgo/tracker-radar-collector/issues/84#issuecomment-1452230159. We're exploring different options to fix this at the moment.