Tracking Redirect Chains
Helloo, I was involved in a project couple of years ago where we analyzed redirect chains. Back then, we could use old_channel_id and new_channel_id in order to track the redirect chains. I have some free time currently and wanted to catch up on some work I did back then. Now, I realized that the database changed and you are supposed to track down the chains via old_request_id and new_request_id. But, the feature seems not to be implemented currently (see here). There is also a reference in the comments (see here) to a quite old stack overflow post how you are supposed to implement the feature. Now, I was wondering if there is a chance that this will be implemented in the near future?
Currently, I also have some time and could maybe invest some days to try and fix it. But, I am not sure if I am able to or if it is possible to implement the feature again currently.
Can you provide me with any information on this?
Any help is appreciated. :)
And great work btw. :)
Best, Beni
Maybe I was to quick with my request above. Am I right that I can track down the redirect chains as follows:
Each redirect chain can be identified by visit_id and old_request_id. So, for a redirect chain I get all http_requests by the visit_id and request_id. Then, I can order the http_redirects and http_requests by event_ordinal.
To verify the chain is correct, I can test that:
-
event_ordinalincrease monotonically and between twohttp_requeststhere should be ahttp_redirectwithevent_ordinalbetween the twohttp_requests -
old_request_urlof thishttp_redirectmatches the url of the firsthttp_requestandnew_request_urlmatches the url of the secondhttp_request
Thanks for coming back and documenting this. Did you check the documentation beforehand if so where would you expect this information?
Thanks a lot for the reply.
Unfortunately, I couldn't find the infomation in the docs. Could you please point out to me where to find how to extract redirect chains from the docs? Apologies, if it is written somewhere I couldn't find it.
But, I figured out that throughout triggered redirects, every new request triggered by a redirect seems to keep the same request_id. So, currently, I work with the following query to extract what I need:
-- Now, just group by old_request_id and aggregate result.
-- The resulting string of each row is supposed to contain the redirect chain originating from old_request_id.
SELECT browser_id, visit_id, old_request_id, string_agg(concat(id, '|||', event_ordinal, '|||', url), '-->' order by event_ordinal)
FROM
(
-- Join http_redirects with http_requests and sort by http_redirect.event_ordinal.
-- Sort by http_redirect.event so that redirects originating from the same request
-- are in the correct order
SELECT hr.browser_id, hr.visit_id, old_request_id, id, event_ordinal, url
FROM
-- Select unique request_ids from http_redirects table.
(
SELECT DISTINCT browser_id, visit_id, old_request_id
FROM http_redirects
) AS distinct_redirects
JOIN http_requests hr ON distinct_redirects.old_request_id = hr.request_id and distinct_redirects.browser_id = hr.browser_id and distinct_redirects.visit_id = hr.visit_id
ORDER BY old_request_id, hr.event_ordinal ASC
) AS ordered_redirects
GROUP BY ordered_redirects.browser_id, ordered_redirects.visit_id, ordered_redirects.old_request_id
Inside columnn redirect_chain of the result, the query is supposed to return tuples of (http_request.id, http_requests.event_ordinal, http_requests.url) separated by an arrows '-->' containing all redirects originating from the first tuple.
I plan to write an extra query which recursively constructs the chains from old_request_url and new_request_url of the http_redirects table to verify the above query.
But, in case you can confirm that my query above extracts redirect chains it would give me some extra security.
Many thanks, Beni
Unfortunately, I couldn't find the infomation in the docs. Could you please point out to me where to find how to extract redirect chains from the docs? Apologies, if it is written somewhere I couldn't find it.
Sorry for being too terse. It's not documented anywhere, I want to add your comment to an appropriate place and was wondering where you as a user woud look first.
Iirc the methodology you proposed in your second comment is the one we used but unfortunately some of our analysis remained proprietary and I can't find the code to confirm it in OpenWPM-utils. We used Spark and would transform the redirect chain into a list similar to what you proposed with the recursive query.
I can't verify the SQL query right now but if this implements the logic previously described I'd give it tentative approval.
No worries, thanks a lot.
From the current documentation, I would expect the information to be somewhere in the "Schema Documentation" part.