openwayback icon indicating copy to clipboard operation
openwayback copied to clipboard

CDX-index-based playback: redirect URLs with www clobbering ones without www in them

Open christianleger opened this issue 9 years ago • 0 comments

Running the 2.3.0 distribution of CDX-indexer and Openwayback.

When replaying a WARC file via a CDX index, I'm getting an 'unavailable' response to a particular URL request even though the record is definitely in the WARC file. A reference to the URL exists in the CDX index, which points to the correct WARC file, and the correct offset.

The URL in question is also the new location of a previously www-prefixed identical URL. In the WARC file, the response for the www-prefixed URL is a 301 (pointing to the URL without www), and the one for non-www is a 200.

Specifically:

http://mgerc-ceegm.gc.ca/index-eng.html has a 200 response entry in the WARC file. http://www.mgerc-ceegm.gc.ca/index-eng.html has a 301 response entry, which points to http://mgerc-ceegm.gc.ca/index-eng.html.

It appears that the cause is that in the CDX index. There are two consecutive entries:

mgerc-ceegm.gc.ca/index-eng.html .... 200 mgerc-ceegm.gc.ca/index-eng.html .... 301

In debugging, I saw that the replay software finds both the 200 and 301, but always acts (to the webapp frontend) as if only the 301 exists. It finds it, then tries to redirect, then fails to display the 200 record - the result states that the requested page is unavailable.

Removing the line having the 301 result from the CDX index file allows me to view the record!

It would be nice to see both entries in the CDX index be findable, possibly by adjusting the CDX-index generation to something such as:

example.com .... 200 www.example.com .... 301

Many thanks!

christianleger avatar Feb 10 '16 15:02 christianleger