lychee icon indicating copy to clipboard operation
lychee copied to clipboard

Check fragments for remote URLs only for certain MIME types

Open MichaIng opened this issue 6 months ago • 2 comments

Currently, with fragment checking enabled, all remote URLs are (tried to be) downloaded completely, into RAM and passed to the fragment checker as if it was an HTML document, causing unnecessary traffic, memory usage, and often failures in the fragment checker for binary files.

#1733 aims to solve this for most cases, skipping fragment checking if there is no non-empty fragment in the URL. It is however not ruled out that fragments are handled server-side, hence valid URLs with intentional fragments to non-HTML files. For such cases, it would be great to additionally check the content-type of the HTTP response, and invoke the fragment checker only if it can actually handle that type.

Additional condition to start with: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/website.rs#L100

response.headers().get("content-type").is_some_and(|x| x.starts_with("text/html")), true)

Also text/markdown would be possible, adjusting the file type for the fragment checker to file_type: crate::FileType::Markdown accordingly.

Since text/markdown does not seem to be widely used, not automatically served for .md file extensions by latest Apache2 at least, neither by GitHub, we could additionally accept text/plain, if the URL path ends with .md.

MichaIng avatar Jun 18 '25 15:06 MichaIng

Sounds like a plan. Want to create a PR for that? The issue description could serve as a nice comment for documenting the behavior.

mre avatar Jun 20 '25 14:06 mre

Want to create a PR for that?

After #1733 has been merged, because it affects the same code sections and tests.

MichaIng avatar Jun 20 '25 14:06 MichaIng