lychee icon indicating copy to clipboard operation
lychee copied to clipboard

Cache fails to invalidate when changing `--remap` options

Open katrinafyi opened this issue 4 months ago • 3 comments

Using Lychee in the same directory with changing --remap options will fail to invalidate the cache when --remap is changed. I've run into this bug when testing remaps to try and write a correct remap expression.

I think a contributing factor to this bug is the .lycheecache entries are keyed by the original URL, before any remaps are applied.

These commands should demonstrate the problem.

echo 'https://google.com' > a.md
# run once. this succeeds
lychee a.md --cache -v
# introduce remap which should cause failure. still succeeds.
lychee a.md --cache -v --remap "https://google.com https://google.com/nonexistent"
cat .lycheecache

# try in other direction (with remap initially, then removing it)
rm .lycheecache
lychee a.md --cache -v --remap "https://google.com https://google.com/nonexistent"
# incorrectly fails
lychee a.md --cache -v
cat .lycheecache

An example log of the commands is below, with lychee 0.20.0.

$ echo 'https://google.com' > a.md
$ lychee a.md --cache -v
     [200] https://google.com/

🔍 1 Total (in 0s) ✅ 1 OK 🚫 0 Errors
$ lychee a.md --cache -v --remap "https://google.com https://google.com/nonexistent"
   [INFO ] Cache is recent (age: 40s, max age: 1d 0h 0m 0s). Using.
     [200] https://google.com/ | OK (cached)

🔍 1 Total (in 0s) ✅ 1 OK 🚫 0 Errors
$ cat .lycheecache
https://google.com/,200,1756040828
$ rm .lycheecache
$ lychee a.md --cache -v --remap "https://google.com https://google.com/nonexistent"
     [404] https://google.com/nonexistent/ | Rejected status code (this depends on your "accept" configuration): Not Found

Issues found in 1 input. Find details below.

[a.md]:
     [404] https://google.com/nonexistent/ | Rejected status code (this depends on your "accept" configuration): Not Found

🔍 1 Total (in 0s) ✅ 0 OK 🚫 1 Error
$ lychee a.md --cache -v
   [INFO ] Cache is recent (age: 1m 26s, max age: 1d 0h 0m 0s). Using.
     [404] https://google.com/ | Error (cached)

Issues found in 1 input. Find details below.

[a.md]:
     [404] https://google.com/ | Error (cached)

🔍 1 Total (in 0s) ✅ 0 OK 🚫 1 Error
$ cat .lycheecache
https://google.com/,404,1756040883

There might also be similar bugs for other flags which affect URL resolution. However, since the cache is only used for remote URLs (I think?), --fallback-extensions and --index-files do not seem to be affected by this bug.

katrinafyi avatar Aug 24 '25 13:08 katrinafyi

There might also be similar bugs for other flags which affect URL resolution. However, since the cache is only used for remote URLs (I think?), --fallback-extensions and --index-files do not seem to be affected by this bug.

Correct, the cache is only used for remote URLs.

I think a contributing factor to this bug is the .lycheecache entries are keyed by the original URL, before any remaps are applied.

Yes, we use the original URL. The reasoning was that remaps might change between runs, so the original URL would be the "source of truth" we could depend on.

By invalidating the cache, do you mean deleting the .lycheecache file if we detect remaps? The alternative might be to skip the cache entirely and print a warning when remaps get used. The best option might be if we only ignore the cache entries for URLs which get remapped.

mre avatar Sep 09 '25 09:09 mre

By invalidate, I just mean that if I have a cached URL and then I change the remap, it should no longer re-use the old cache entry. In the demo commands, the fixed behaviour would be that the second lychee invocation fails, and the fourth invocation should succeed.

I have no strong opinion as to how this is implemented, but I think throwing away the entire cache would be a rather blunt approach. I think the best solution would be to change the cache keys to be addresses of HTTP requests which are actually sent (in effect, this would be after remaps). This makes sense as the lychee cache records HTTP response codes, so it makes sense that it should be keyed by HTTP request URLs. I don't know if this would cause other problems.

Yes, we use the original URL. The reasoning was that remaps might change between runs, so the original URL would be the "source of truth" we could depend on.

I don't really follow the reasoning here. Are you worried that if a remap has run-dependent info (like pwd?), then it might lead to cache misses?

katrinafyi avatar Sep 09 '25 10:09 katrinafyi

I think the best solution would be to change the cache keys to be addresses of HTTP requests which are actually sent (in effect, this would be after remaps).

Agreed.

I don't know if this would cause other problems.

If it causes any problem then likely because other parts of the workflow depend on incorrect behavior, so it should be fine. 😊

I don't really follow the reasoning here. Are you worried that if a remap has run-dependent info (like pwd?), then it might lead to cache misses?

Sorry, please ignore my earlier reasoning. I was initially thinking that the cache should always use the original URL as the key, regardless of any remaps applied. For example, if you have google.com and remap it to foo.com, I thought we should cache under google.com since that's the "source" URL.

But your proposal makes much more sense: we should cache using the final URL after remaps are applied (foo.com in this example). This is logical because the cache stores HTTP response codes, so it should be keyed by the actual URLs we request. We never actually check the original URL when remaps are involved, so there's nothing meaningful to cache there.

I was just confused earlier. Let's go forward with your proposal.

mre avatar Sep 09 '25 10:09 mre