OpenWPM icon indicating copy to clipboard operation
OpenWPM copied to clipboard

Duplicate request_id

Open nrllh opened this issue 3 years ago • 9 comments

I noticed in my dataset that the same request_id was assigned for different requests (although it's rare). This currently means that the request_id in callstacks cannot be clearly assigned.

It is particularly important that I find the right request_id for call stacks. Depending on the timestamp, I could take the first request (after the last request in the callstack), but I'm not sure if it's a reliable solution. Do you have an idea how I can work around the problem?

Here is an example I have in my dataset:

site_id subpage_id url top_level_url method referrer headers is_XHR is_third_party_channel is_third_party_to_top_window resource_type time_stamp is_websocket body etld content_hash is_tracker is_background_req in_scope window_id tab_id frame_id parent_frame_id frame_ancestors request_id triggering_origin loading_origin loading_href req_call_stack post_body post_body_raw url_scope global_uniq_id
47 0 https://contextual.media.net/cksync.php?cs=1&type=vzn&ovsid={{APID}}&redirect=https%3A%2F%2Fpixel.advertising.com%2Fups%2F58222%2Fsync%3F_origin%3D1%26uid%3D%24UID https://www.msn.com/de-de/ GET https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp [["Host","contextual.media.net"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,/"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp"],["Connection","keep-alive"],["Cookie","hbcm_sd=1%7C1646673074314; visitor-id=2896746747280784000V10"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","same-origin"]] 0 1 null image 2022-03-07T19:11:14.410000 0 null media.net null null null null 1 1 2147483652 2147483649 [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] 129 https://contextual.media.net https://contextual.media.net https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp null null null https://contextual.media.net/cksync.php 192876
47 0 https://ups.analytics.yahoo.com/ups/58222/sync?_origin=1&uid=0000EEA&apid=UP9841187a-9e39-11ec-a345-061779e0c7c0 https://www.msn.com/de-de/ GET https://contextual.media.net/ [["Host","ups.analytics.yahoo.com"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,/"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/"],["Connection","keep-alive"],["Cookie","A3=d=AQABBLI8JmICEPL2EPXsDfBFliWLBa28-40FEgEBAQGOJ2IwYgAAAAAA_eMAAAcIsjwmYq28-40&S=AQAAAkXJG3i7bt2vymX74kfQ1VQ; B=8rutsllh2cf5i&b=3&s=rs; IDSYNC=18xa~23mh"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","cross-site"]] 0 1 null image 2022-03-07T19:11:14.939000 0 null yahoo.com null null null null 1 1 2147483652 2147483649 [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] 129 https://contextual.media.net https://contextual.media.net https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp null null null https://ups.analytics.yahoo.com/ups/58222/sync 193101
47 0 https://pixel.advertising.com/ups/58222/sync?_origin=1&uid=0000EEA https://www.msn.com/de-de/ GET https://contextual.media.net/ [["Host","pixel.advertising.com"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,/"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/"],["Connection","keep-alive"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","cross-site"]] 0 1 null image 2022-03-07T19:11:14.585000 0 null advertising.com null null null null 1 1 2147483652 2147483649 [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] 129 https://contextual.media.net https://contextual.media.net https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp null null null https://pixel.advertising.com/ups/58222/sync 192941
47 0 https://pixel.advertising.com/ups/58222/sync?_origin=1&uid=0000EEA&verify=true https://www.msn.com/de-de/ GET https://contextual.media.net/ [["Host","pixel.advertising.com"],["User-Agent","Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0"],["Accept","image/avif,image/webp,/"],["Accept-Language","en-US,en;q=0.5"],["Accept-Encoding","gzip, deflate, br"],["Referer","https://contextual.media.net/"],["Connection","keep-alive"],["Cookie","APID=UP9841187a-9e39-11ec-a345-061779e0c7c0"],["Sec-Fetch-Dest","image"],["Sec-Fetch-Mode","no-cors"],["Sec-Fetch-Site","cross-site"]] 0 1 null image 2022-03-07T19:11:14.759000 0 null advertising.com null null null null 1 1 2147483652 2147483649 [{"frameId":2147483649,"url":"https://contextual.media.net/medianet.php?cid=8CUT39MWR&crid=715624197&size=306x271&https=1"},{"frameId":0,"url":"https://www.msn.com/de-de/"}] 129 https://contextual.media.net https://contextual.media.net https://contextual.media.net/checksync.php?&vsSync=1&cs=1&hb=1&cv=37&ndec=1&cid=8HBSKZM1Y&prvid=77%2C117%2C184%2C188%2C203%2C226%2C246%2C2030%2C2033%2C3018&itype=HB-CM&rtime=9&https=1&gdpr=1&gdprconsent=1&usp_status=0&usp_consent=1&dcfp=gdpr,usp null null null https://pixel.advertising.com/ups/58222/sync 193016

PS: global_uniq_id is my intern row number.

nrllh avatar Apr 13 '22 08:04 nrllh

Hey, this might be due to these request being part of a redirect chain. Iirc during a single redirect the http channel gets reused. So all of these requests might indeed be triggered by a single call. Try looking at the response_status in the http_responses and see if that brings up anything.

vringar avatar Apr 13 '22 09:04 vringar

The http_redirects might be outdates/no longer needed.

vringar avatar Apr 13 '22 09:04 vringar

Hey, thanks! Yes, it's the case. All of them are redirects. However, I still wonder what this should mean for callstacks. Is request_id in the table callstacks a reference of the last such request or the first one - based on timestamp?

nrllh avatar Apr 13 '22 10:04 nrllh

It's a reference to the entire request chain. The script creates the first request, which then returns with a redirect status code and kicks off the second request. So indirectly the script is responsible for both requests, even though it only directly started the first one. So based on timestamp it directly caused the first one but for analysis purposes it might be helpful to create a mapping from callstack to ordered list of redirects.

When we have done such analysis we called those request chains.

vringar avatar Apr 13 '22 10:04 vringar

Thank you very much, it helped to solve my issue. So I'm closing the issue.

nrllh avatar Apr 13 '22 15:04 nrllh

@vringar sorry for the spam, but I didn't want to create a new issue for that since it's potentially related to this issue:

Problem 1: As I can see, it's not possible to correlate the requests in call_stack row (in the callstacks table) with the an ID directly. I guess the only option is to compare strings and hope to get the right request id. If there are multiple records with the same request URL, it's very hard to find the right request_id for the requests that appear in call_stack.

Problem 2: Another problem I face is how can I determine which request triggered the next one. As long as I could observe the sequence of requests is either top-down or bottom-up. Here an example:

 instrumentFunction/<@https://space.bilibili.com/7584632:362:25;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:30329;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:23700;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:23299;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:22815;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:22575;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:100310;null
o@https://s1.hdslb.com/bfs/static/jinkela/space/space.ff495225cc805974552c20fc851f8da0f2cd085a.js:1:51142;null
videoExposureReport@https://s1.hdslb.com/bfs/static/jinkela/space/11.space.ff495225cc805974552c20fc851f8da0f2cd085a.js:1:27800;null
770/mounted/</<@https://s1.hdslb.com/bfs/static/jinkela/space11.space.ff495225cc805974552c20fc851f8da0f2cd085a.js:1:27070;null
value@https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js:1:23700;null
sentryWrapped@https://s1.hdslb.com/bfs/static/jinkela/long/js/sentry/sentry-5.2.1.min.js:2:37520;null

Problem 3: As you can see the URL https://s1.hdslb.com/bfs/seed/log/report/log-reporter.js appears in different sequences. How should I interpret that?

Thank you very much in advance!

nrllh avatar Apr 21 '22 11:04 nrllh

Hey,

  1. I'm sorry I don't quite understand this problem. Which ID do you want to correlate it to? The request_id? You can use the request_id to correlate with a redirect chain. Where the first element in the redirect chain is the URL originally called and the last one is the URL which returned with some but a 3XX status code. What other correlation do you want?
  2. I don't think you can determine which request triggered which other one. The callstack is bottom-to-top. So the first function called is sentryWrapped which then calls value which calls 770/mounted/</< however that name came about.
  3. This is because the script is calling other functions in the same script. I'm assuming they are all called value because they are all function objects or whatever the minifier produced. And the call from other things might end up back in the script due to callbacks or smt.

vringar avatar May 05 '22 15:05 vringar

1/ I think the level of tracing you want to do is just not possible with the instrumentation we have in place right now. The stacks we save come directly from the browser; we don't have a way to label which script URL listed in the stack corresponds to which webRequest ID. That would require a bunch of plumbing throughout the browser to trace properly. Note that if you link a call stack table row back to a web request, then you know which JS context that call is executing. So this is only a problem when there are multiple copies of a script executing in a same exact context (which does happen).

2&3/ it sounds like you might be confusing call stack with HTTP redirects? Like Stefan mentions the call stack shows calling relationships between scripts which are executing in the same JS context, not a series of requests. So scripts can call into each other (or use methods defined in one another).

englehardt avatar May 18 '22 03:05 englehardt

Thank you very much, I had some difficulties for understanding the callstacks, but now it's clear.

Not sure if I create an issue, but I can't see for all HTTP redirects their DNS responses. It seems we have only the final request's DNS response of request chains. That means, probably we are missing some data for redirect chains in the table dns_responses.

nrllh avatar May 19 '22 11:05 nrllh

I noticed that DNS issue myself and filed #1020 for it.

englehardt avatar Jan 12 '23 02:01 englehardt

Hi there! I am an undergraduate researching into browser fingerprinting. So, ultimately,

  1. What is the difference between id and request_id?
  2. How are request_id grouped?

wesley-tan avatar Jun 10 '24 03:06 wesley-tan

Hi there! I am an undergraduate researching into browser fingerprinting. So, ultimately,

1. What is the difference between id and request_id?

2. How are request_id grouped?
  1. The id is the row number, which increases independently of request_id or visit_id. The request_id is the ID of HTTP requests, and it resets after each visit.
  2. The data is grouped by visit_id.

nrllh avatar Jun 10 '24 07:06 nrllh