lychee icon indicating copy to clipboard operation
lychee copied to clipboard

[Feature] allowlist of URL patterns

Open dkarlovi opened this issue 2 years ago • 8 comments

Expanding on https://github.com/lycheeverse/lychee/issues/452#issuecomment-1016220435, we just had a case where an URL was (by accident) generated as a full URL, something like

file://localhost/home/me/project/public/thing/index.html

This was caught by Lychee locally, but not on CI, not sure yet what's different.

In general, my proposal is to:

  1. treat all URLs without a hostname as if they're on the --base (and treat it as an offline link), as suggested in #452
  2. allow users to specify which URL patterns they allow / expect, so they can keep control of for example outgoing links, etc.

This would allow for catching simple errors link this, but also allow for checking what external links are used (don't link to shady sites), sort of link "link hygiene").

dkarlovi avatar Jan 31 '22 14:01 dkarlovi

This was caught by Lychee locally, but not on CI, not sure yet what's different.

Did you set a base in any of the two environments? Did you run lychee from the same base directory? You can try passing --verbose to see if the link gets detected.

Added some comments to https://github.com/lycheeverse/lychee/issues/452. About your second point, I like the idea in general. I wonder how often this is needed, though and how to prioritize this. We could have a --block <pattern> option, which immediately exits with an error if a link matching pattern is encountered. Similarly we could support --allow with the opposite effect. However, could both occur at the same time? How would these be handled? Would the order matter?

mre avatar Jan 31 '22 22:01 mre

Did you set a base in any of the two environments?

I did, it's hacking around the missing --root argument like shown in https://github.com/lycheeverse/lychee/issues/452#issuecomment-1016220435. Don't worry about that part, it's probably PEBKAC. :smile: I just wanted to give some context to the feature suggestion.

However, could both occur at the same time? How would these be handled? Would the order matter?

IMO, the simplest way is to have a single list with both allow and deny patterns (since they're patterns, they can be negated, think .gitignore). The order in the list is then significant, first match wins and allows the user to set it up in any way they see fit. The question is only if the list is "allow" or "deny" by default, which is more a question what the users are likely to want to do more.

dkarlovi avatar Feb 01 '22 08:02 dkarlovi

We have a .lycheeignore already for excluding patterns. Initially I thought about adding support for tags like DENY to this file, but I think it's a bad idea, because these tags could be part of a real pattern. Alternatively we could have a [deny] and an [allow] section in lychee.conf. My fear is that people would be confused about what comes first. Don't want to add .lycheedeny or another file to the mix to be honest. It would be hard to keep track of special files. Other ideas?

mre avatar Feb 01 '22 11:02 mre

If we go with a single list, I would vote for allow because --deny !foo bar can be confusing whereas --allow foo !bar is easier to parse for me.

mre avatar Feb 01 '22 11:02 mre

Before adding anything, though, I would like to know if this is a common enough use-case to warrant a new option.

mre avatar Feb 01 '22 11:02 mre

Sure, this might be a niche option indeed, but it does make sense to me since you might short-circuit some / all external HTTP requests because they're disallowed by your setup. It's like a difference between XML being valid, syntax vs semantics. Just because the link points to a thing which exists doesn't mean I want it in my docs. IMO it fits nicely with the tool goals, at least IIUC.

Alternatively we could have a [deny] and an [allow] section in lychee.conf. My fear is that people would be confused about what comes first.

Apache solved this with saying which rules are evaluated first with Order: Allow, Deny or vice versa. I'm not a big fan and IIRC it's deprecated now.

Don't want to add .lycheedeny or another file to the mix to be honest.

Agreed, from my POV most (all?) config should be passable via the command line because I might want to override something in the config file (for example, commit config for prod and dev, need an override on CI / test).

Since allow and deny are the same thing from a different POV (meaning, allow http://localhost is the same as deny all except http://localhost, I still think a unified list is a way to go, especially since then the user is able to solve really complex use cases just by reordering the items in the list (first match wins) and it reduces the number of things the user needs to understand.

Maybe your idea of adding tags to the existing file is the way to go, it's just a matter of finding tokens / syntax which can be safely used.

dkarlovi avatar Feb 01 '22 14:02 dkarlovi

As a temporary workaround, you could pass --dump and grep the output for the forbidden patterns. grep's exit code would indicate if there were any disallowed URIs.

mre avatar Feb 01 '22 16:02 mre

@mre that's actually a really nice idea! :100:

dkarlovi avatar Feb 01 '22 17:02 dkarlovi