lychee
lychee copied to clipboard
include more information in the JSON output
Today, the JSON output only prints (only) detailed information about the links that failed:
{
"total": 68,
"successful": 67,
"failures": 1,
"timeouts": 0,
"redirects": 0,
"excludes": 0,
"errors": 0,
"fail_map": {
"docs/README.md": [
{
"url": "https://semaphoreci.com/ramitsurana/awesome-kubernetes",
"status": "Failed: HTTP status client error (404 Not Found) for url (https://semaphoreci.com/ramitsurana/awesome-kubernetes)"
}
]
}
}
It will be better if JSON output will also include detailed information about the other links ("successful", "timeouts", etc.).
I'm not a Rust developer, so before I try to figure out how I can help with that, is this something that we will be accepted if will open a PR?
It consumes more memory if you'd want to store the status of successful links.
Probably in combination with --verbose
/-v
flag only?
Thanks for the feature request @eyarz.
We can add that, although as @lebensterben said, it would increase memory usage and especially for bigger websites and recursive requests that could be quite significant.
We could store it in verbose mode, however I wonder if excluded or redirected URLs should also be added then.
I was wondering if we want to add multiple levels of verbosity at some point (like -vvv
) and only include it on the highest level.
About the implementation, we'd have to create a success
in the ResponseStats
struct here: https://github.com/lycheeverse/lychee/blob/7e497723cb27252008d63892018987dcebbfe995/lychee-bin/src/stats.rs#L25-L35
and save the URL in add
here:
https://github.com/lycheeverse/lychee/blob/7e497723cb27252008d63892018987dcebbfe995/lychee-bin/src/stats.rs#L43
and skip the field serialization with a serde attribute
like this if the verbosity isn't high:
// Use a method to decide whether the field should be skipped.
#[serde(skip_serializing_if = "Verbosity::High")]
success: Vec<String, String>,
In my opinion, everything should be included in the JSON output (including redirects). @mre, I like the idea of multiple levels - this way, the user can control if he wants to "risk" with high memory usage.
BTW, my use case is that we are using lychee with this Awesome Kubernetes project and I want to run another validation (github projects are not archived) on the passing links. without some way to get the passing links from lychee, I need to extract them by myself again (which makes no sense in this use case).
Did you see #271? It might be an alternative which could cover more use cases. In your situation you could filter the GitHub links with grep then and check if they are archived. Lychee exits with a non-zero exit code in case of errors, in which case your CI would fail as expected. So in the end it would be something like
lychee --output raw | grep github.com | <check github link>
I guess this (#271) will also help me with my use case. It's up to you to decide what should be prioritized :)
regarding extending the JSON format, would be feasible to track the lines of each occurrence like:
{
"total": 68,
"successful": 67,
"failures": 1,
"timeouts": 0,
"redirects": 0,
"excludes": 0,
"errors": 0,
"fail_map": {
"docs/README.md": [
{
"url": "https://semaphoreci.com/ramitsurana/awesome-kubernetes",
"status": "Failed: HTTP status client error (404 Not Found) for url (https://semaphoreci.com/ramitsurana/awesome-kubernetes)",
"lines": [13]
}
]
}
}
?
I'm thinking on writing a GitHub Workflow that will open an issue listing every line that has a dead link (to maintain repos awesome-x
). I'm trying to avoiding writing another script to just extract the line numbers
line no cannot be easily added since the AST strangely doesn't record the raw line/column number.
https://docs.rs/markup5ever_rcdom/latest/markup5ever_rcdom/enum.NodeData.html
if we use https://github.com/tree-sitter/tree-sitter-html plus some more effort we can have a highly efficient parser with line/column number though.
that would be great. Unfortunately, I didn't know Rust so I can't help you guys :/
I was just going to file an issue for reporting line numbers -- that I can click an open in text editor, like an error in a python program -- but I see this here. Line numbers would indeed be welcome.
See #480, which will be the first step towards making line numbers possible. @untitaker fyi.
I think we can build this into html5gum and it's going to be easier than html5ever (because I'm more familiar with the codebase), but if we go with the external-stream hack that gives you approximate line number and is probably just as easy with html5gum/html5ever. But I haven't done much investigation on that front.
👍 Comment about the external-stream hack for future reference: https://github.com/lycheeverse/lychee/pull/480#issuecomment-1027301788
Took a while but extended stats (successful and excluded requests) are in master
now. You can enable them in verbose mode (-v
). 🕺