Treat llms.txt and llms-full.txt as markdown
These files are generated for LLMs, and while they don't have to be, the examples I've seen are nearly all markdown. I'd like to scan these for broken links, because LLMs hallucinate, but to do that we need to ignore e.g. links in code blocks, which lychee handles today for known markdown.
I believe that the needed change is to special-case these two file names as known markdown.
Example Docs: https://llmstxt.org
It sounds tempting, but if you think about it, that will create a slippery slope of unexpected behavior.
Say someone happens to store research links on LLMs in a plaintext file called llms.txt. They would be surprised if that file was suddenly handled as a Markdown file. While that scenario is highly unlikely, special cases tend to cause problems like this down the line, and we'd be in a position where keeping the behavior predictable in the first place would have been the better strategy.
I'm surprised that they decided to use txt as the format when their examples clearly show Markdown syntax. That's probably to adopt the naming pattern of other files like robots.txt, but it's still an odd decision. They even mention sitemap.xml in the docs. There is an [open issue about renaming the file to llms.md], which I believe could be a good place for adding a comment about the surprising behavior in combination with link checkers like lychee.
We might consider adding an --override-extension parameter, which takes a regex (or a glob?) and attempts to change the file extension on the fly. E.g., --override-extension llms.txt:md.
As a workaround, you could run lychee twice for now. Once excluding the llms.txt file and the second time piping in llms.txt from stdin and explicitly setting the extension to md:
lychee --exclude-path llms.txt .
cat llms.txt | lychee --default-extension md -
While that scenario is highly unlikely, special cases tend to cause problems like this down the line
Fair
I'm surprised that they decided to use txt as the format when their examples clearly show Markdown syntax.
100% agree
We might consider adding an --override-extension parameter, which takes a regex (or a glob?) and attempts to change the file extension on the fly. E.g., --override-extension llms.txt:md.
Agreed that this would be a nice way to implement this
As a workaround, you could run lychee twice for now
👍