lychee icon indicating copy to clipboard operation
lychee copied to clipboard

include more information in the JSON output

Open eyarz opened this issue 3 years ago • 14 comments

Today, the JSON output only prints (only) detailed information about the links that failed:

{
  "total": 68,
  "successful": 67,
  "failures": 1,
  "timeouts": 0,
  "redirects": 0,
  "excludes": 0,
  "errors": 0,
  "fail_map": {
    "docs/README.md": [
      {
        "url": "https://semaphoreci.com/ramitsurana/awesome-kubernetes",
        "status": "Failed: HTTP status client error (404 Not Found) for url (https://semaphoreci.com/ramitsurana/awesome-kubernetes)"
      }
    ]
  }
}

It will be better if JSON output will also include detailed information about the other links ("successful", "timeouts", etc.).

I'm not a Rust developer, so before I try to figure out how I can help with that, is this something that we will be accepted if will open a PR?

eyarz avatar Jun 24 '21 12:06 eyarz

It consumes more memory if you'd want to store the status of successful links.

lebensterben avatar Jun 24 '21 15:06 lebensterben

Probably in combination with --verbose/-v flag only?

MichaIng avatar Jun 24 '21 15:06 MichaIng

Thanks for the feature request @eyarz. We can add that, although as @lebensterben said, it would increase memory usage and especially for bigger websites and recursive requests that could be quite significant. We could store it in verbose mode, however I wonder if excluded or redirected URLs should also be added then. I was wondering if we want to add multiple levels of verbosity at some point (like -vvv) and only include it on the highest level.

About the implementation, we'd have to create a success in the ResponseStats struct here: https://github.com/lycheeverse/lychee/blob/7e497723cb27252008d63892018987dcebbfe995/lychee-bin/src/stats.rs#L25-L35 and save the URL in add here:

https://github.com/lycheeverse/lychee/blob/7e497723cb27252008d63892018987dcebbfe995/lychee-bin/src/stats.rs#L43

and skip the field serialization with a serde attribute like this if the verbosity isn't high:

    // Use a method to decide whether the field should be skipped.
    #[serde(skip_serializing_if = "Verbosity::High")]
    success: Vec<String, String>,

mre avatar Jun 24 '21 16:06 mre

In my opinion, everything should be included in the JSON output (including redirects). @mre, I like the idea of multiple levels - this way, the user can control if he wants to "risk" with high memory usage.

BTW, my use case is that we are using lychee with this Awesome Kubernetes project and I want to run another validation (github projects are not archived) on the passing links. without some way to get the passing links from lychee, I need to extract them by myself again (which makes no sense in this use case).

eyarz avatar Jun 27 '21 08:06 eyarz

Did you see #271? It might be an alternative which could cover more use cases. In your situation you could filter the GitHub links with grep then and check if they are archived. Lychee exits with a non-zero exit code in case of errors, in which case your CI would fail as expected. So in the end it would be something like

lychee --output raw | grep github.com | <check github link>

mre avatar Jun 27 '21 10:06 mre

I guess this (#271) will also help me with my use case. It's up to you to decide what should be prioritized :)

eyarz avatar Jun 27 '21 10:06 eyarz

regarding extending the JSON format, would be feasible to track the lines of each occurrence like:

{
  "total": 68,
  "successful": 67,
  "failures": 1,
  "timeouts": 0,
  "redirects": 0,
  "excludes": 0,
  "errors": 0,
  "fail_map": {
    "docs/README.md": [
      {
        "url": "https://semaphoreci.com/ramitsurana/awesome-kubernetes",
        "status": "Failed: HTTP status client error (404 Not Found) for url (https://semaphoreci.com/ramitsurana/awesome-kubernetes)",
        "lines": [13]
      }
    ]
  }
}

?

I'm thinking on writing a GitHub Workflow that will open an issue listing every line that has a dead link (to maintain repos awesome-x). I'm trying to avoiding writing another script to just extract the line numbers

micalevisk avatar Dec 08 '21 00:12 micalevisk

line no cannot be easily added since the AST strangely doesn't record the raw line/column number.

https://docs.rs/markup5ever_rcdom/latest/markup5ever_rcdom/enum.NodeData.html

lebensterben avatar Dec 08 '21 00:12 lebensterben

if we use https://github.com/tree-sitter/tree-sitter-html plus some more effort we can have a highly efficient parser with line/column number though.

lebensterben avatar Dec 08 '21 00:12 lebensterben

that would be great. Unfortunately, I didn't know Rust so I can't help you guys :/

micalevisk avatar Dec 08 '21 01:12 micalevisk

I was just going to file an issue for reporting line numbers -- that I can click an open in text editor, like an error in a python program -- but I see this here. Line numbers would indeed be welcome.

reagle avatar Jan 13 '22 20:01 reagle

See #480, which will be the first step towards making line numbers possible. @untitaker fyi.

mre avatar Feb 04 '22 11:02 mre

I think we can build this into html5gum and it's going to be easier than html5ever (because I'm more familiar with the codebase), but if we go with the external-stream hack that gives you approximate line number and is probably just as easy with html5gum/html5ever. But I haven't done much investigation on that front.

untitaker avatar Feb 04 '22 11:02 untitaker

👍 Comment about the external-stream hack for future reference: https://github.com/lycheeverse/lychee/pull/480#issuecomment-1027301788

mre avatar Feb 04 '22 11:02 mre

Took a while but extended stats (successful and excluded requests) are in master now. You can enable them in verbose mode (-v). 🕺

mre avatar Dec 20 '22 09:12 mre