pywb
pywb copied to clipboard
Pywb failing to handle self-redirects from OutbackCDX
Describe the bug
Pywb is throwing a LiveResourceException when receiving a self-redirect (3xx) from OutbackCDX. This results in Pywb displaying a blank page with the text "Not found".
Steps to reproduce the bug
Example warc file attached in this Slack thread https://iipc.slack.com/archives/C2NR32PNF/p1691445882952669 Try accessing "http://2020.org.nz/" using the redirect record, should display "Not found" message.
Expected behavior
Pywb to process the self-redirect record from OutbackCDX, and load the record that the self-redirect points to.
Screenshots
Pywb logs
OutbackCDX logs
Environment
- Not browser specific
- Occurs in Pywb v2.7.0 onwards
- Does not occur before v2.7.0
- OutbackCDX v0.11.0
Thank you for the excellent bug report, with the pywb version dependence.
I can't access the slack warc file because my org isn't a member. I only have guest access.
We experience the same issue with OutbackCDX v. 0.11.1 and PyWb v. 2.7.4. Redirects result in "Not found".
@wumpus unfortunately the warc was too big to attach here. Happy to share it with you another way if you'd like it
I see Ilya has been assigned by Tessa and I know he does have access to the IIPC Slack. So it's in good hands.
We also ran into this issue recently. We use PyWb 2.83. We checked with 2.6.9, there it worked. Are there any plan to fix this? Thanks
Echoing same error as well with PyWb 2.83 and OutbackCDX 1.0.0
OutbackCDX has a partial workaround for this. If you run it with the --omit-self-redirects command-line option (or pass omitSelfRedirects=true in the query string) it will try to use the CDX redirect field to detect self redirects and hide them.
Unfortunately pywb's cdx-indexer and webrecorder/cdxj-indexer don't populate the redirect field though so if you used them to build your indexes this workaround won't help you. Without the redirect field populated there's no way for OutbackCDX to detect self redirects.
(For reference we use jwarc for CDX indexing plus some weird extra logic to handle our legacy pre-WARC collections.)