pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Pywb not returning cdx results for URLs in harvest

Open obrienben opened this issue 3 years ago • 0 comments

Describe the bug

Multiple URLs from an NLNZ harvest are not displaying via Pywb. We have narrowed the issue down as far as Pywb is excluding correct CDX records from the results it receives from OutbackCDX (or internally managed CDX). In testing I have extracted out one of the problem CDX records into it's own CDX file. When testing this, Pywb displays the result correctly in the UI. But when testing the original, if I run the same query to OutbackCDX as Pywb is curl 'http://localhost:8082/index?url=http://www.theaudience.co.nz/genre/alt-indie&closest=&sort=closest&matchType=prefix' then I get 124 lines of CDX, including the correct line, but see no results in the Pywb UI.

I have tested this with Pywb's inbuilt CDX management, and I get the same results.

Steps to reproduce the bug

To see an example of one URL from the problem harvest. Using Pywb and OutbackCDX, load either of the attached CDX files into OutbackCDX. Then search for the following URL: "http://www.theaudience.co.nz/genre/alt-indie/?ajax=true&_=1507495059747" via the Pywb collection search UI. Using test_successful.cdx, one result will be returned. Using test_failing.cdx, no results will be returned.

test_cdx.zip

Expected behavior

Pywb to return the same results when searching for the following URL "http://www.theaudience.co.nz/genre/alt-indie/?ajax=true&_=1507495059747", using either of the attached CDX files, and loading them into OutbackCDX or Pywb.

Screenshots

test_successful.cdx image

test_failing.cdx image

Environment

  • OS: Has occurred on RHEL7 and Ubuntu 20.

  • Browser: Not browser specific, but tested in Chrome and Firefox.

  • Version: Python v3.6.8 and v3.8.10. Pywb v2.6.7.

Additional context

If I search for an exact match of the problem URL, then I do see results in Pywb. However, the harvest itself does not load these URLs exactly due to the dynamic ID on the end. So this must be using the fuzzy matching logic in Pywb.

obrienben avatar Aug 02 '22 02:08 obrienben