lychee icon indicating copy to clipboard operation
lychee copied to clipboard

[UX] uclear how to use --base-url, --root-dir and --remap together to get a built folder checked

Open dkarlovi opened this issue 7 months ago • 9 comments

Ref #1606

Let's use this for an example:

mkdir -p public/assets public/download public/test1 public/test2
touch public/assets/app.css public/assets/app.js public/download/something.json public/test1/index.html

and the public/index.html file:

<html>
<head>
    <link rel="stylesheet" href="/sub/dir/another/assets/app.css">
    <link rel="stylesheet" href="https://example.com/sub/dir/another/assets/app.css">
</head>
<body>

<dl>
    <dt>index</dt>
    <dd><a href="/sub/dir/another">/sub/dir/another</a></dd>
    <dd><a href="/sub/dir/another/">/sub/dir/another/</a></dd>

    <dt>file</dt>
    <dd><a href="/sub/dir/another/download/something.json">/sub/dir/another/download/something.json</a></dd>

    <dt>homepage</dt>
    <dd><a href="/sub/dir/another/test1/">/sub/dir/another/test1/</a></dd>
    <dd><a href="/sub/dir/another/test2/">/sub/dir/another/test2/</a></dd>
</dl>

</body>
</html>

When I run Lychee like so

lychee --base-url https://example.com/sub/dir/another --root-dir `pwd`/public public/index.html --offline

I expect these things to happen:

  1. it checks everything (currently it excludes everything)
  2. it finds all the links are valid, except test2 because we didn't create an index.html file in it (I'm assuming I need to remap this?)
  3. it treats URLs prefixed with base-url as offline in this case (because it's equivalent to the others)

Am I missing something or should this already work as is? I'm using just-built master.

dkarlovi avatar May 27 '25 09:05 dkarlovi

Hi! For the past week I've been interrogating lychee for this use case. I also think it would be really nice to have some documentation for this use case!

In the meantime, I can offer some observations from what I've managed to piece together. Currently,

  • base-url is used as part of relative and absolute links, and
  • root-dir is used in absolute links only.

Specifically, when encountering a relative or absolute link, Lychee constructs the target URL in this way:

<a href="relative/1"> --> {parent(base-url)}/relative/1
<a href="/absolute/1"> --> {origin(base-url)}{root-dir}/absolute/1

where origin gets the origin of its argument and parent gets the parent of the URL if it doesn't end in / or returns the original URL if it does end in /.

I surmise that the base-url is meant to emulate the links as if the local HTML file was accessed at the base-url address. However, this causes problems when checking multiple local HTML files. Lychee will assume every local file is at that base-url and disregard the local folder structure for relative links. However, if base-url is omitted, then it infers it from the file path which leads to the desired behaviour.

Anyway, to check a directory $dir of local files, I have found that this almost works:

lychee $dir --root-dir $(realpath $dir) --remap "https://your-site\.github\.io file://$dir"

This will correctly resolve relative links by the folder structure and absolute links by the root-dir. It's just missing an awareness of index.html - I've submitted https://github.com/lycheeverse/lychee/pull/1777 to add this feature.

In reply to your numbered points, I would say

  1. All links are excluded because they are all being mapped to a remote https://example.com URL using the method described above. This can be seen by passing -v to Lychee.
  2. Historically, Lychee never attempted to find index files and took directory links as valid if the directory existed at all. This was very recently changed in https://github.com/lycheeverse/lychee/pull/1752 and my PR https://github.com/lycheeverse/lychee/pull/1777 adds a flag for configuring this.
  3. I think your expectation for base-url is the inverse of its actual behaviour. You expect URLs starting with base-url to be mapped to local files, but the actual behaviour is local links are mapped to base-url. That said, I do think it would be useful to have a feature that maps certain remote URLs to local files. In the command above, this has to be done using the --remap argument.

All this is to say that I think the documentation around these flags is insufficient. In particular, it's missing this information about how they interact. I also think this use case is very common but it's missing clear documentation. If https://github.com/lycheeverse/lychee/pull/1777 is merged, I plan to add a "recipe" to the docs site about this use case.

Finally, these are only my observations and might be wrong. Also, this is not necessarily endorsing the current behaviour - I'm just writing about what it is at the moment.

katrinafyi avatar Aug 01 '25 06:08 katrinafyi

@katrinafyi thank you so much for this writeup, I helps a lot because my understanding of the flags is so wrong! Arguably, this might mean the flags are not well named so this might be seen as an UX bug (?)

I think your expectation for base-url is the inverse of its actual behaviour. You expect URLs starting with base-url to be mapped to local files, but the actual behaviour is local links are mapped to base-url.

Right, but since I've explicitly said my base URL is that, Lychee knows those base-url URLs are actually my local files and should be candidates to examine even in offline mode, no?

dkarlovi avatar Aug 01 '25 08:08 dkarlovi

Maybe...? Lychee does have this information, but it doesn't currently use it in this way. I think Lychee's understanding of offline/online files is very simplistic at the moment, only based on file:// vs http[s]://.

For the time being, I think --remap would be the way to get this behaviour. Luckily, this only needs remapping of the beginning of URLs. Remapping the beginning is much more straightforward than remapping the end of URLs (like you would need for index files).

I'd agree that this is a UX and documentation problem :)

katrinafyi avatar Aug 01 '25 09:08 katrinafyi

Maybe this use case needs more thinking over then because, as you've noted yourself, it's IMO a very common use case to want to support.

For example:

  1. by adding --base-url, you've said how the files should be resolved as, making all URLs which match it "offline"
  2. by adding --root-dir, you've explained how to resolve relative URLs, but this could arguably be omitted if you pass a folder to examine, that folder is then the root dir by default
  3. by adding --index-files from your #1777, you're saying URLs pointing to local folders should work only if they contain one of those index files, otherwise fail

So something like this

lychee --base-url https://example.com/dir/subdir --index-files index.html --offline public/ # note no --root-dir

should IMO be enough to get the desired behavior, seems clear and robust, but not sure how far Lychee is from it currently as is.

WDYT @katrinafyi @mre @thomas-zahner?

dkarlovi avatar Aug 01 '25 09:08 dkarlovi

Hi again @dkarlovi, I think something like that would be really good! I'll try to re-phrase the behaviour to check my understanding and anticipate some implementation details. Let me know if this makes sense.

As always, I will say that I, personally, have no opinion about the default behaviour nor the naming of flags. To keep things simple, I'll use the flag names --new-base-url and --new-root-dir to differentiate from the current flags.

--new-root-dir will take a "root dir" argument which is a local directory, and --new-base-url will take a "base URL" (which might be a file:// URL). The big idea is that Lychee should resolve all links as if the root directory was uploaded and available at the base URL.

If we want to do this correctly and consistently in all cases, there are a few things to consider. I think it should work like this:

  • (1) When collecting links in local files: We first compute the file's "remote URL" by taking its local path relative to root-dir and joining this with base-url, then we use this remote URL to resolve relative links within the file.
    • (1.1) If the resolved link is a subpath of base-url, then we slice off base-url and re-base the link onto root-dir to obtain the local file to use.
    • (1.2) Otherwise, we take the resolved link as is and it will refer to a remote webpage somewhere outside of base-url.
  • (2) When collecting links from remote pages, we can use the remote page's URL as is and resolve links relative to this URL.
    • (2.1) If the resolved link is a subpath of base-url, then we slice off base-url and re-base the link onto root-dir to obtain the local file to use.
    • (2.2) Otherwise, we take the resolved link as is and it will refer to a remote webpage somewhere outside of base-url.
    • (2.3) Yes, (2.1) and (2.2) are identical to the (1.) cases. Having separate numbers will become useful later.

If it works, this outline is nice because most of the logic is the same regardless of where links are collected from. We just have to do the resolving and re-basing step for links in local files (2). Then, the base-url detection applies uniformly to all links.

Implementation details:

  • What to do for path traversal like ../root-dir which might traverse outside of root-dir, then re-enter root-dir. Does this matter? Tbh idk if Lychee currently handles this in a sensible way anyway.
  • If a file:// URL is used as a base-url, this will need special handling. Because file:// URLs do not have a meaningful domain, a basic implementation would resolve any /-prefixed link to the root of the filesystem. For file:// base URLs, we probably want / to resolve to the base-url instead.

But what about remaps? — By throwing together the current behaviour of the flags and some remaps, we can certainly get most of the way. To try and mimic the behaviour of --new-base-url https://example.com/project1/ and --new-root-dir $dir, you would have to make a fake root and move $dir inside that at the correct subpath. Something like this:

mkdir root
cp $dir root/project1
lychee ./root --root-dir $(pwd)/root --remap "https://example\.com/project1 file://$(pwd)/root/project1"

By my reasoning, this will match the behaviour in cases (1.1), (2.1), and (2.2). The problem is with (1.2). If you had a link like /index.html inside $dir, Lychee would currently resolve that to $(pwd)/root/index.html which doesn't exist. Following the --new-base-url rules, this should actually become https://example.com/index.html. Remaps cannot handle this because it needs to match all files within $(pwd)/root which are not within $(pwd)/root/project1. The remap regex does not support this kind of negative lookahead (and even if it did, I think that would be unpleasant for other reasons).

What about the current --base-url? — I think I've talked about this enough but I think that the current behaviour of --base-url is unuseful and it should be avoided if you care at all about the directory structure -- and we do care about the directory structure. But honestly, I feel so confused by reading the docs. The text and example commands (especially those checking **/*.html) are not consistent with the current --base-url behaviour I observe. I can't tell what's happened here. Is it a regression? Was the flag only ever checked with a single file as input? Or are the docs simply wrong?

Anyway, sorry for the long post. That's just what I would want to see from my ideal link checker.

katrinafyi avatar Aug 09 '25 05:08 katrinafyi

@katrinafyi thank you very much for this detailed clarification, this is super helpful to have in one place to be able to reason about:

What to do for path traversal like ../root-dir which might traverse outside of root-dir, then re-enter root-dir.

If we can reasonably resolve them into an absolute path without looking at the filesystem, it might make sense, but IMO the value here is very limited (since you can reason about the path outside the root dir in an extremely limited way), I'd have no issue with these either be seen as not part of the conversation or throwing errors (at least at first, until the use case is clarified via specific examples).

I think that the current behaviour of --base-url is unuseful and it should be avoided if you care at all about the directory structure

That's very unfortunate because it means the feature (which is IMO the basic requirement for offline / filesystem checks) is unusable. Would it be viable to detect we're doing this specific use case and special-case the flag to opt in to the new, correct behavior?

dkarlovi avatar Aug 11 '25 08:08 dkarlovi

the feature (which is IMO the basic requirement for offline / filesystem checks) is unusable.

Yeah it is unfortunate. That said, the root-dir + remap command does get you most of the way. We use that in our CI because we have no links which fall into the unsupported case - but that's really just a lucky coincidence. For other situations, I can imagine it would be difficult to use Lychee as it is right now.

Idk about detecting cases and changing the behaviour, even if it is to something "more" correct. Personally I would find it a little surprising.

Edit: maybe it makes sense to change behaviour when base-url is simultaneously given with root-dir?? That would require reinterpreting root-dir as a local filesystem directory rather than a remote path segment.

As long as the Lychee developers decide the feature is something they would want, I think the mechanism for enabling the behaviour can be whatever it needs to be. It can be decided down the line.

katrinafyi avatar Aug 11 '25 09:08 katrinafyi

Edit: maybe it makes sense to change behaviour when base-url is simultaneously given with root-dir?

Yes, that's fuzzily the idea I had in mind: you're saying you're doing offline (filesystem based) checking, Lychee could switch to a "mode" which better supports that. I don't know how viable that is or how much that mode differs from what happens currently, it might not even be necessary if the --base-url has as many issues as you've concluded.

dkarlovi avatar Aug 11 '25 09:08 dkarlovi

What is actually the purpose of checking links online when you have the files locally? Seems unnecessary overhead and server load to me. Whether deployment/uploads happen correctly could be checked easier without checking each and every internal link.

So IMO --root-dir works pretty well the way it does, checking all internal links without explicit scheme://host locally, now as well with configurable index files and fragment checking. And --base-url remains optional to check those all online or on any remote host/with custom scheme.

Relative local links are already resolved correctly within the directory structure, no need to use --base-url for this.

Since directories cannot be an input, there is not really a default possible for --root-dir. The current PWD maybe 🤔. One could think about allowing local dirs as input, but often there are not only HTML and/or Markdown files, but assets (images, fonts, CSS, JavaScript, ...) etc, which one does not want to check. And adding another logic or option to filter those files/extensions seems to double the way one can pass all needed files types with recusrive asterisk like build/**/*.html already.

--offline is to enforce everything being checked offline, though not sure what it does with full scheme URL, whether it just skips them? However: the purpose is somewhat the opposite of what --base-url is commonly used for, i.e. both more edge cases: I'd always prefer to check internal URLs all locally but external URLs all online, i.e. all URL in the most efficient way. We do even compose our links like that: internal URLs without scheme are used only for those which can be resolved locally/within the same repo. For any other internal links which are pointing to files in another repo or handled by some proxy etc, even if on the same host, we use full scheme URLs. So when building the (part of the) website locally, all links always work, and internal ones without scheme are always reaolved locally.

And --remap is a very special thing for some rare workarounds which cannot be covered with above combinations, like if some actually internal links in the local files for whatever reason use full scheme URLs, but you can and want to check them within the local dir structure. Or you have some remote URL(s) as input, but can and want to check their internal links locally. Or you need to use a proxy to bypass some rate limiting or ISP/firewall restrictions at the test system. EDIT: Or if you have some internal links without scheme which however cannot be resolved within the local dir structure. But as said, to make the local build functional at all, I'd consequently use full scheme URLs for those.

What is bad or unclear about these options?

EDIT: Okay I think I understand the issue now: https://github.com/lycheeverse/lychee/pull/1912#issuecomment-3649598964

I can only think of the one reasonable behavior of those two settings the way I explained it in the PR: --base-url needs to be relative to --root-dir, basically replacing the file://... URL that is generated from the root dir with the remote URL.

MichaIng avatar Dec 13 '25 15:12 MichaIng

Participants in this thread would be interested in https://github.com/lycheeverse/lycheeverse.github.io/pull/140 which should document how to use lychee for this use case. It is a bit complicated (especially the subfolder case), but hopefully it is understandable.

Reviews and comments would be appreciated :)

katrinafyi avatar Dec 14 '25 07:12 katrinafyi

Continuing discussion from #1912

@katrinafyi Oh okay, you thought about it the opposite way round. Well, the way it is used currently (even that it works for relative URLs only), and the way I remember discussions when the original single --base option was split into --base-url and --root-dir, it really was meant to make internal URLs to an external domain, i.e. to check everything online.

As mentioned, I personally would just consequently use internal URLs (without scheme://host) in all those links pointing to pages within the same repo, but of course sometimes this is not well under maintainers control. So I agree mapping a specific host to be resolved locally wouldn't be bad as a dedicated option either. But as new option, instead of inverting the (sort of) intended behaviour --base-url was supposed to work. So we'd have two options, for mapping a fixed remote host to local dir structure vice versa, and --remap for finer grained mappings.

@mre Since my memory is like a Swiss cheese by times, can you verify that --base-url was supposed to provide the target base URL to resolve to, instead of a source to map to local URLs? And would you agree to have a dedicated option to map the opposite way round? We'd really need to think about a sane name then 😅. --base-source-url 🙄. I think with the related --remap values as equivalent, it should be easy to implement, or could be even internally translated into --remap options.

It could be even used in combination, just applied in the order the options are given, so the last of these + --remap take procedure of prior given ones.

MichaIng avatar Dec 14 '25 10:12 MichaIng

the way it is used currently (even that it works for relative URLs only), ..., it really was meant to make internal URLs to an external domain, i.e. to check everything online.

Yeah, hmm. I wasn't around for those discussions so I don't know the use case. What is the use case? To me, it doesn't make sense to discover links within local files then check them online. Will there be problems with relative links to files that aren't online yet?

In any case, it should be easy to implement a feature that works in both directions, depending on the user's choice.

Maybe the flag could be named --url-map. It could use syntax like --remap and the direction would be chosen by the order of the two paths. This seems intuitive because the feature is essentially like remap but using the URL structure instead of a regex. It also emphasises that the local path and remote URL are paired together.

katrinafyi avatar Dec 14 '25 13:12 katrinafyi

it doesn't make sense to discover links within local files then check them online

I also don't use it, agreed in this regards, but I guess there is some use case if you want to assure things work at the actual production webserver, or if the local repo contains only a part of the whole website, not covering all local links. We also do not need the opposite feature, since all links which point to pages from the same repo are internal links without scheme://host, and we do not scan anything of our live websites as input either, hence --root-dir does all we need. So for me, both is needed only when there is no full control of all pages/links you want to test as target or origin.

If the new option would require more than only a single URL prefix as input, like a direction, then I actually do not see the benefit compared to just promoting/describing the --remap feature properly, which is not only named very similarly (to --url-map), but can do exactly the same things in both directions with 2 values. A new option IMO makes only sense if it can work with a single value, to make it the clear counter-part to --base-url. But we should better get clarification whether that one was really intended to work as I thought. Definitely possible that I am wrong 😅.

MichaIng avatar Dec 14 '25 18:12 MichaIng

We also do not need the opposite feature,

... is needed only when there is no full control of all pages/links

There are certain cases where internal links have to be fully-qualified including scheme://host, like in https://github.com/lycheeverse/lychee/issues/1918.

If you haven't read it yet, I try to motivate the use case in https://github.com/lycheeverse/lycheeverse.github.io/pull/140. I also mention the inconveniences and caveats of using root-dir and remap for this use case.

More broadly, I would not want things which are conceptually equivalent (when the site is deployed) to dictate link checking behaviour. Users shouldn't have to think about lychee when writing their HTML.

If the new option would require more than only a single URL

I imagined --url-map would take a URL prefix and a local directory path. With your terminology, I think these would be given separately as --base-source-url and --base-url, or maybe one of them is --root-dir. In any case, there are always two things which are needed: a remote URL and a local path.

But we should better get clarification

Agree!

katrinafyi avatar Dec 14 '25 20:12 katrinafyi