pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Confusing presentation of redirected captures

Open petsva opened this issue 4 years ago • 2 comments

Problem

The presentation of redirected captures can be confusing. This problem has been discussed in Slack

Situation

A warc file contains a response for http:example.com which has a 301/302 status and redirects to https:example.com. The same warc file also contains a response for https:example.com with 200 status.

Search

If you then search with pywb for example.com, you get hits for both responses in the search results, with different time stamps. And this is regardless of if you specify http, https or none of them in the search string. The links of both hits goes (despite being different URL:s) to a page showing the https page.

This is confusing for the user. I looks like there are two captures of the same page, with just hours in between, and they have exactly the same content and apperance.

Solution ideas

  • All redirected URL hit items in the list could be shown visually different.
  • Collapsing rules was discussed -- collapsing hit items with timestamps which are less than a configurable amount of time apart.
  • If the 'redirect' value in CDX* data was used, hit items which is redirected to other hit items could be deleted.
  • Some other solution which reduces confusion.

petsva avatar Feb 15 '21 22:02 petsva

One solution to this that @ldko suggested on today's OH-SOS call was to follow practice in Internet Archive's Wayback of color coding captures in the calendar. Here the green indicates that the results are for a 3XX redirect:

Screenshot 2023-06-06 at 12 05 44 PM

edsu avatar Jun 06 '23 16:06 edsu

Yes, that would be good. But the colour difference is hard to see in the time stamps when you click, so some extra marking is needed. And also (from Slack:)

... when a redirected page is shown, I think. The user should be alerted that it is not what he/she selected that is displayed, but something else, from another date/time.

And then maybe some configurable filtering of the index query results, e.g. if a 200- and a 3xx result has a time stamp difference < X minutes and are "canonically equal" (just http/https or www difference), the 3xx result should not be presented. (A common case, I suppose.)

petsva avatar Jun 07 '23 00:06 petsva