pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Pywb failing to handle self-redirects from OutbackCDX

Open obrienben opened this issue 1 year ago • 7 comments

Describe the bug

Pywb is throwing a LiveResourceException when receiving a self-redirect (3xx) from OutbackCDX. This results in Pywb displaying a blank page with the text "Not found".

Steps to reproduce the bug

Example warc file attached in this Slack thread https://iipc.slack.com/archives/C2NR32PNF/p1691445882952669 Try accessing "http://2020.org.nz/" using the redirect record, should display "Not found" message.

Expected behavior

Pywb to process the self-redirect record from OutbackCDX, and load the record that the self-redirect points to.

Screenshots

Pywb logs image OutbackCDX logs image

Environment

  • Not browser specific
  • Occurs in Pywb v2.7.0 onwards
  • Does not occur before v2.7.0
  • OutbackCDX v0.11.0

obrienben avatar Sep 22 '23 00:09 obrienben

Thank you for the excellent bug report, with the pywb version dependence.

I can't access the slack warc file because my org isn't a member. I only have guest access.

wumpus avatar Sep 22 '23 04:09 wumpus

We experience the same issue with OutbackCDX v. 0.11.1 and PyWb v. 2.7.4. Redirects result in "Not found".

lasztoth avatar Oct 10 '23 08:10 lasztoth

@wumpus unfortunately the warc was too big to attach here. Happy to share it with you another way if you'd like it

obrienben avatar Nov 21 '23 22:11 obrienben

I see Ilya has been assigned by Tessa and I know he does have access to the IIPC Slack. So it's in good hands.

wumpus avatar Nov 22 '23 01:11 wumpus

We also ran into this issue recently. We use PyWb 2.83. We checked with 2.6.9, there it worked. Are there any plan to fix this? Thanks

andreas-koch avatar Jul 12 '24 13:07 andreas-koch

Echoing same error as well with PyWb 2.83 and OutbackCDX 1.0.0

HeliosLHC avatar Jul 15 '24 01:07 HeliosLHC

OutbackCDX has a partial workaround for this. If you run it with the --omit-self-redirects command-line option (or pass omitSelfRedirects=true in the query string) it will try to use the CDX redirect field to detect self redirects and hide them.

Unfortunately pywb's cdx-indexer and webrecorder/cdxj-indexer don't populate the redirect field though so if you used them to build your indexes this workaround won't help you. Without the redirect field populated there's no way for OutbackCDX to detect self redirects.

(For reference we use jwarc for CDX indexing plus some weird extra logic to handle our legacy pre-WARC collections.)

ato avatar Jul 25 '24 06:07 ato