pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Rewrite any page URL extensions from .html to /

Open Sitetools opened this issue 7 months ago • 0 comments

Expected behavior

We are archiving a website that has had a few incarnations, the older archives have pages with a .html extension and the newer archives have /. I want any pages with the exact URL paths except for the file extension (like below) to treated as if they are the same URL.

  • https://mydomain.com/segment-1/segment-2.html
  • https://mydomain.com/segment-1/segment-2/

What actually happened

If I navigate to https://mydomain.com/segment-1/segment-2.html I only able to see dates where this exact page URL was archived and not pages with https://mydomain.com/segment-1/segment-2/ that were archived at a later date.

Things I have tried

I have tried many variations of filtering and fuzzy matching in config.yaml, I did add a rules.yaml file but it was ignored.

default_filters:
  url_normalize:
    - match: '.html$'
      replace: '/'
      
rules:
  - url_prefix: 'com,mydomain)/'
    
    rewrite:
      fuzzy_lookup:
        - match: '.html'
          replace: '/'

It would be good to know if I'm in the right ballpark with how to resolve my issue.

Browser

I am running Chrome on Ubuntu, but I have tested on Firefox

Sitetools avatar Mar 17 '25 22:03 Sitetools