Option to exclude failed requests from cache
The cache is incredibly helpful to reduce repetitive URL checks when running lychee e.g. on PR and push events from GitHub Actions, also reducing the risk to run into rate limiting. We do the following:
- uses: actions/cache/restore@v4
with:
path: .lycheecache
key: lycheecache-${{ github.run_id }}
restore-keys: lycheecache-
- run: ./lychee --cache --max-cache-age 2d --cache-exclude-status '201..'
- uses: actions/cache/save@v4
if: always()
with:
path: .lycheecache
key: lycheecache-${{ github.run_id }}
This makes use of the GitHub Actions cache action, split into dedicated restore and save steps, so that it can be configured to save the cache as well on URL check failures.
Entries older than 2 days are ignored, of course this could be well increased.
What I try to achieve with the exclude is to assure that any previously failed request is checked again, instead of the failed result taken from cache. Especially when running into rate limits or temporary network issues at any side, a rerun can often turn things to green. Excluding HTML response codes from 201 upwards however works only if there actually was any response. In case of network errors, there is no code, and the field in the cache file is empty, no way to exclude (AFAIK). Also in case the fragment check fails, the cache entry contains an empty response code.
Hence it would be great to have a way to exclude empty HTTP response codes due to network errors from cache as well.
Optionally, some short code failed or something, to exclude all cache entries which would be treated as failure, depending on --accept as well, could be handy. I think it is the most common use case, to not check functional links in short repetition, but allow to recheck failed ones with the same workflow. But it does not replace the empty response exclusion. E.g. we accept 429 as well, the GitHub rate limit response, needed since GitHub tokens work for api.github.com, but not for github.com, and we have a LOT of GitHub repo, wiki and profile URLs in our documentation website. However, running the check multiple times, caching 200 responses, but checking 429 ones again, allows lychee to get 200 responses from all GitHub URLs at some point. Not an awesome solution, but it works well enough.
Related: https://github.com/lycheeverse/lychee-action/issues/291
First of all, thanks for lychee. Very neat tool!
We're running into the same issue :)
The cache is great, but it's pretty unintuitive that failed urls are also cached.
This is especially true for scenarios where one is checking against URLs to a book or website one has full control over, as one can easily fix the issue and restore the URL. For this case, it would be great to have a config flag to not cache failed entries
Even though I know of this behavior, I've repeatedly run into this and wondered why an URL isn't reachable 😀
@charludo would you be interested in fixing this one as well. I know that you recently worked on some of the cache stuff. 😃