lychee icon indicating copy to clipboard operation
lychee copied to clipboard

Option to exclude failed requests from cache

Open MichaIng opened this issue 7 months ago • 3 comments

The cache is incredibly helpful to reduce repetitive URL checks when running lychee e.g. on PR and push events from GitHub Actions, also reducing the risk to run into rate limiting. We do the following:

    - uses: actions/cache/restore@v4
      with:
        path: .lycheecache
        key: lycheecache-${{ github.run_id }}
        restore-keys: lycheecache-

    - run: ./lychee --cache --max-cache-age 2d --cache-exclude-status '201..'


    - uses: actions/cache/save@v4
      if: always()
      with:
        path: .lycheecache
        key: lycheecache-${{ github.run_id }}

This makes use of the GitHub Actions cache action, split into dedicated restore and save steps, so that it can be configured to save the cache as well on URL check failures.

Entries older than 2 days are ignored, of course this could be well increased.

What I try to achieve with the exclude is to assure that any previously failed request is checked again, instead of the failed result taken from cache. Especially when running into rate limits or temporary network issues at any side, a rerun can often turn things to green. Excluding HTML response codes from 201 upwards however works only if there actually was any response. In case of network errors, there is no code, and the field in the cache file is empty, no way to exclude (AFAIK). Also in case the fragment check fails, the cache entry contains an empty response code.

Hence it would be great to have a way to exclude empty HTTP response codes due to network errors from cache as well.

Optionally, some short code failed or something, to exclude all cache entries which would be treated as failure, depending on --accept as well, could be handy. I think it is the most common use case, to not check functional links in short repetition, but allow to recheck failed ones with the same workflow. But it does not replace the empty response exclusion. E.g. we accept 429 as well, the GitHub rate limit response, needed since GitHub tokens work for api.github.com, but not for github.com, and we have a LOT of GitHub repo, wiki and profile URLs in our documentation website. However, running the check multiple times, caching 200 responses, but checking 429 ones again, allows lychee to get 200 responses from all GitHub URLs at some point. Not an awesome solution, but it works well enough.

MichaIng avatar Jun 16 '25 18:06 MichaIng

Related: https://github.com/lycheeverse/lychee-action/issues/291

mre avatar Jun 20 '25 15:06 mre

First of all, thanks for lychee. Very neat tool!

We're running into the same issue :)
The cache is great, but it's pretty unintuitive that failed urls are also cached.

This is especially true for scenarios where one is checking against URLs to a book or website one has full control over, as one can easily fix the issue and restore the URL. For this case, it would be great to have a config flag to not cache failed entries

Even though I know of this behavior, I've repeatedly run into this and wondered why an URL isn't reachable 😀

Nukesor avatar Jul 24 '25 15:07 Nukesor

@charludo would you be interested in fixing this one as well. I know that you recently worked on some of the cache stuff. 😃

mre avatar Sep 12 '25 22:09 mre