pywb
pywb copied to clipboard
Confusing presentation of redirected captures
Problem
The presentation of redirected captures can be confusing. This problem has been discussed in Slack
Situation
A warc file contains a response for http:example.com
which has a 301/302 status and redirects to https:example.com
. The same warc file also contains a response for https:example.com
with 200 status.
Search
If you then search with pywb for example.com
, you get hits for both responses in the search results, with different time stamps. And this is regardless of if you specify http, https or none of them in the search string. The links of both hits goes (despite being different URL:s) to a page showing the https page.
This is confusing for the user. I looks like there are two captures of the same page, with just hours in between, and they have exactly the same content and apperance.
Solution ideas
- All redirected URL hit items in the list could be shown visually different.
- Collapsing rules was discussed -- collapsing hit items with timestamps which are less than a configurable amount of time apart.
- If the 'redirect' value in CDX* data was used, hit items which is redirected to other hit items could be deleted.
- Some other solution which reduces confusion.
One solution to this that @ldko suggested on today's OH-SOS call was to follow practice in Internet Archive's Wayback of color coding captures in the calendar. Here the green indicates that the results are for a 3XX redirect:
Yes, that would be good. But the colour difference is hard to see in the time stamps when you click, so some extra marking is needed. And also (from Slack:)
... when a redirected page is shown, I think. The user should be alerted that it is not what he/she selected that is displayed, but something else, from another date/time.
And then maybe some configurable filtering of the index query results, e.g. if a 200- and a 3xx result has a time stamp difference < X minutes and are "canonically equal" (just http/https or www difference), the 3xx result should not be presented. (A common case, I suppose.)