Check fragments for remote URLs only for certain MIME types
Currently, with fragment checking enabled, all remote URLs are (tried to be) downloaded completely, into RAM and passed to the fragment checker as if it was an HTML document, causing unnecessary traffic, memory usage, and often failures in the fragment checker for binary files.
#1733 aims to solve this for most cases, skipping fragment checking if there is no non-empty fragment in the URL. It is however not ruled out that fragments are handled server-side, hence valid URLs with intentional fragments to non-HTML files. For such cases, it would be great to additionally check the content-type of the HTTP response, and invoke the fragment checker only if it can actually handle that type.
Additional condition to start with: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/website.rs#L100
response.headers().get("content-type").is_some_and(|x| x.starts_with("text/html")), true)
Also text/markdown would be possible, adjusting the file type for the fragment checker to file_type: crate::FileType::Markdown accordingly.
Since text/markdown does not seem to be widely used, not automatically served for .md file extensions by latest Apache2 at least, neither by GitHub, we could additionally accept text/plain, if the URL path ends with .md.
Sounds like a plan. Want to create a PR for that? The issue description could serve as a nice comment for documenting the behavior.
Want to create a PR for that?
After #1733 has been merged, because it affects the same code sections and tests.