Add flag to force html mode
This is an attempt to address #671 by adding a flag --html to parse
the input as HTML. Otherwise, STDIN and local files without the .html
suffix are parsed as plain text.
Thanks for pushing this forward @nacnudus.
Some questions:
- What if a user wants to check different file types in a single run? E.g. one Markdown file and one HTML file. Would the
--htmlflag always override the file extension? - Are there any other tools that have a similar feature? I know that ripgrep has a
--typeoption, which allows a user to filter input files by type (but not parse/interpret the input files differently). That's the closest I could find. (A precedent would help us make the best design decision for lychee.)
We could also change the option to --input-type <format>, which is a bit more flexible. E.g. --input-type markdown would enforce the Markdown parser instead. It still would not work for different input types, though, but that might be fine.
Alternatively, we could just add a section to the TROUBLESHOOTING.md file with a workaround on how to enforce the correct file type by storing the input to a temporary file with the html extension. Not pretty, but we avoid maintaining the option.
Looking forward to your thoughts.
Hi, thanks for your reply, and apologies for my slowness responding.
- What if a user wants to check different file types in a single run? E.g. one Markdown file and one HTML file. Would the --html flag always override the file extension?
Yes, --html would always override the file extension. I really have unix pipes in mind, which don't have file extensions.
- Are there any other tools that have a similar feature?
Imagemagick requires an image format specifier to read an APNG image sequence, otherwise it assumes that it is PNG and reads only the first frame.
We could also change the option to
--input-type <format>
I agree, that would be better.
Alternatively, we could just add a section to the TROUBLESHOOTING.md file with a workaround on how to enforce the correct file type by storing the input to a temporary file with the html extension.
That wouldn't work for STDIN, but perhaps lychee isn't the best tool for that any, because pipes can't benefit from lychee's multithreading, as far as I know.
currently we relies on extension to determine the file format, and non supported format are treated as plaintext where we simply grab urls.
this process is VERY inefficient due to the underlying implementation of linkify.
instead, I think we can TRY to parse plaintext as html first, should that fail, we fallback to linkify.
this is helpful given the most likely use case of lychee is to validate links in html rather than any other formats.
this is helpful given the most likely use case of lychee is to validate links in html rather than any other formats.
I'm not sure if that's actually true. Most users use lychee-action and I wouldn't be surprised if the majority of links are checked in Markdown files.
this process is VERY inefficient due to the underlying implementation of linkify.
Not sure if parsing HTML is faster, either. We'd have to run a benchmark to see if trying HTML first would give us any significant performance gains.
plain text file are text/plain and markdown files are text/markdown so
I think we can TRY to parse plaintext as html first
doesn't apply to mark down files and as such
I wouldn't be surprised if the majority of links are checked in Markdown files
this is irrelevant.
Not sure if parsing HTML is faster, either. We'd have to run a benchmark to see if trying HTML first would give us any significant performance gains.
Parsing HTML is definitely faster than using linkify on normal sized html files. This is because a parsed html file is a structured data and links will only appear in certain tags.
My preference would be a ripgrep/fd style type option, specifically a --default-type option which allows the user to override the behaviour for a file without an extension, e.g. --default-type html.
I like that.