action-validator icon indicating copy to clipboard operation
action-validator copied to clipboard

Fancy errors?

Open bcheidemann opened this issue 2 months ago • 7 comments

Summary

As noted by @dabrahams in https://github.com/mpalmer/action-validator/issues/72#issuecomment-2086780223, the output of the CLI does not make it easy to identify the location of the error. Because it only provides a path to the offending item, users must manually traverse the file to find the source of the error.

Proposal

Implement Rust style diagnostic output.

Prototype

I've been experimenting with saphyr (which offers spanned yaml parsing) and annotate_snippets (which provides an ergonomic API for producing Rust-style diagnostic reports), and I've been able to hack together a (very messy) prototype:

Image
diff --git a/src/lib.rs b/src/lib.rs
index 5d4f9d4..38a8bac 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -6,6 +6,7 @@ mod validation_error;
 mod validation_state;
 
 use config::{ActionType, RunConfig};
+use saphyr::{LoadableYamlNode, MarkedYamlOwned, SafelyIndex as _, ScalarOwned, YamlDataOwned};
 use std::path::PathBuf;
 use validation_error::ValidationError;
 use validation_state::ValidationState;
@@ -157,6 +158,7 @@ pub mod cli {
 
 fn run(config: &RunConfig) -> ValidationState {
     let file_name = config.file_name.unwrap_or("file");
+
     let doc = serde_yaml::from_str(config.src);
 
     let mut state = match doc {
@@ -190,6 +192,50 @@ fn run(config: &RunConfig) -> ValidationState {
     state.action_type = Some(config.action_type);
     state.file_path = config.file_path.map(|file_name| file_name.to_string());
 
+    let doc = saphyr::MarkedYamlOwned::load_from_str(config.src)
+        .unwrap()
+        .into_iter()
+        .next()
+        .unwrap();
+
+    for error in &state.errors {
+        let mut node = Some(&doc);
+        match error {
+            ValidationError::NoFilesMatchingGlob {
+                code,
+                detail,
+                path,
+                title,
+            } => {
+                for segment in path.split("/").skip(1).collect::<Vec<_>>() {
+                    let Some(prev_node) = node else {
+                        todo!();
+                    };
+                    node = prev_node.get(segment);
+                }
+                let node = node.unwrap();
+
+                let report = &[annotate_snippets::Level::ERROR
+                    .primary_title(title)
+                    .element(
+                        annotate_snippets::Snippet::source(config.src)
+                            .line_start(26)
+                            .path(config.file_path.unwrap_or(file_name))
+                            .annotation(
+                                annotate_snippets::AnnotationKind::Primary
+                                    .span(node.span.start.index()..node.span.end.index() - 2)
+                                    .label(detail.as_ref().unwrap_or(title)),
+                            ),
+                    )];
+
+                let renderer = annotate_snippets::Renderer::styled()
+                    .decor_style(annotate_snippets::renderer::DecorStyle::Unicode);
+                anstream::println!("{}", renderer.render(report));
+            }
+            _ => {}
+        }
+    }
+
     state
 }

Limitations

Unfortunately, valico requires a serde_json::Value as input. The serde_json::Value type is un-spanned, and so cannot produce the required span information for producing the desired diagnostic output. For this reason, the above POC parses the file twice: first with serde_yaml to produce errors via valico, then again with saphyr to retrieve span information, used to render the diagnostic messages.

Perhaps unsurprisingly, I was not able to find any JSON schema validators which accept a parsed yaml type with span information (such as produced by saphyr) as input.

Obviously, double parsing isn't ideal because it...

  • Adds a performance overhead
  • Unnecessarily increases the file size of the binary
  • Increases the surface area for bugs

Possible Solutions

I've considered a few options...

1. Accept Defeat

Either temporarily, or permanently, we could accept the downsides of double parsing the YAML file as an acceptable cost for improving the ergonomics of the tool.

Pros:

  • Easy - existing implementation is relatively simple
  • No need to incur performance overhead if no errors

Cons:

  • As mentioned above

2. Contribute a saphy feature flag to valico

Because both saphy and serde_json essentially describe the same kind of data, it should be possible to implement this in valico e.g. behind a flag.

Pros:

  • No double parsing
  • No performance overhead

Cons:

  • Supporting two separate formats adds complexity for valico
  • Requires upstream maintainers blessing or a fork (maintainers of valico may not want this feature)

3. Implement conversion from saphy to serde_json::Value

We can convert spahy::MarkedOwnedYaml to serde_json::Value and pass that to valico for validation.

Pros:

  • No double parsing so less scope for bugs
  • Relatively simple to implement conversion (already did a PoC for this)

Cons:

  • Requires cloning the entire AST
  • Some performance impact due to conversion
  • Results in runtime overhead even if no errors are logged

4. Implement spanned parsing for serde_yaml / serde_json / serde

I looked into whether it was possible to add support for span information to our existing dependencies. Long story short, I'm not aware of any way to practically do this in serde at the moment (though I may have missed something) and even if there was a way, it would be necessary to contribute changes to both serde_yaml and serde_json.

Pros:

  • Able to use existing dependencies
  • Addresses all of the above mentioned limitations

Cons:

  • Likely not feasible

5. Implement a "lingua franca" for data formats

At present, there is no agreed upon format for storing or working with unstructured data. serde_json::Value is used extensively, even in our case for non-JSON formats (we work with yaml). However, serde_json::Value owns its data, and cannot be used in cases where another type is required to own the data.

There are many use cases for working with data which do not depend on the specifics of the underlying type. For example, validation is one such use case. Validation requires only knowledge of the structure of the data that is being represented, not details of the types used to represent it.

A trait could be implemented which acts as a "lingua franca" for working with different data types:

enum Value<'data> {
    Null,
    String(&'data str),
    // ...
}

trait DataFormat<'data> {
    type Format: DataFormat<'data>;

    fn safe_get<TKey: AsRef<str>>(&self, key: TKey) -> Self::Format;

    fn as_value(&self) -> Value<'data>;

    // ...
}

This can be implemented by any type, including serde_json::Value, serde_yaml::Value, spahy::MarkedOwnedYaml and others, and accepted by validation libraries such as valico.

This approach is inspired by standard-schema which takes a similar approach to solving an unrelated ecosystem fragmentation issue in the TypeScript space.

Pros:

  • No double parsing
  • No performance overhead
  • No increase in binary size
  • No hacky glue code required in action-validator

Cons:

  • Outside the scope of this project
  • Relies on upstream maintainers accepting this change (at least valico or another JSON schema validation crate would need to "buy in" to this approach)

I'm interested to hear thoughts on this and the above issues/solutions. If there's interest for this feature, I'd be happy to work towards this.

bcheidemann avatar Oct 14 '25 23:10 bcheidemann

Rust diagnostic output is extremely problematic for IDEs that try to handle diagnostics from a wide variety of sources (like emacs). It requires a massive hack to get an acceptable user experience. I beg of you, please please use the gnu error formatting standard, which will already be supported everywhere.

dabrahams avatar Oct 16 '25 00:10 dabrahams

Rust diagnostic output is extremely problematic for IDEs that try to handle diagnostics from a wide variety of sources (like emacs). It requires a massive hack to get an acceptable user experience. I beg of you, please please use the gnu error formatting standard, which will already be supported everywhere.

Hey @dabrahams, thanks for the valuable feedback!

I understand the concern and your proposed alternative makes sense for IDEs. However, Rust style errors, while difficult to parse, have several advantages for humans. Based on my reading of the linked page, the GNU error formatting standard doesn't allow for:

  1. Rich, multi-line errors
  2. Annotated code snippets
  3. Displaying messages for multiple related spans (e.g. "first used here")
  4. Contextual information such as hints or suggestions

That said, I know tools like GCC do output some of this information, but the first line of an error is always compliant. Is the output from e.g. GCC considered compliant with the standard? Does it cause similar headaches for IDEs like emacs?

Genuinely curious to hear your thoughts about this @dabrahams, as I'm no expert in IDEs or error formatting conventions.

I guess my thoughts are:

  1. Yes, we should take steps to ensure diagnostic information is widely usable (including by IDEs if there is a demand for that)
  2. We should not compromise on human readability

If those two goals are incompatible, we could go the way of other tools e.g.

$ action-validator --output=human ... # default
$ action-validator --output=gnu ...

bcheidemann avatar Oct 16 '25 10:10 bcheidemann

Some big and very awesome ideas here, @bcheidemann. Thank you for the write up. I've never been happy with how action-validator presents errors, but I've also never been sufficiently irritated by it to dedicate any time to digging in to fixing it. I'm glad (if that's the right word) that you have been sufficiently irritated. 😀

My "gut check" on each of the approaches you've suggested, based on no more than about 30 seconds' thought, are:

  1. "Accept defeat" and double-parse: yeah, not ideal, but a-v isn't a performance critical tool (as far as I know), so this wouldn't be the end of the world.
  2. saphy support in valico (via feature flag or otherwise): if valico were up for accepting such a thing, it would certainly seem to be a good approach. Presumably others could potentially benefit from spanned parsing, too, so it might not be a hard sell, but I'm sure it's a non-trivial amount of work. At the very least it seems worthwhile to reach out and take their temperature.
  3. conversion from saphy to serde_json::Value: pros/cons seem similar to double-parsing; in both cases there's a performance hit, which one is "worse" might not be obvious (or it might, I really don't know). The deciding factor between 1 and 3 is probably going to be as much about implementation effort as anything else.
  4. spanned parsing in serde: the completely impractical uber-nerd in me like this as the "most correct" solution, but hoo boy it would be a huge amount of work, could take literal years to get serde to adopt it, if they ever would.
  5. common format for parsed data: this is a real "boil the ocean" approach, and touches on something that has always annoyed me about serde: every format provides it's own Value type (or equivalent), when the parsed output should really be a generic serde-provided type. In any event, it feels like it'd be a huge amount of work, no matter where it landed, and yeah, it'd need buy-in from lots of people to make it valuable, which makes it a long-term approach, at best.

At a glance, if valico wanted saphy support and it wasn't a lot of effort, that seems like the ideal approach, otherwise whichever of 1 or 3 is easiest to do seems the best approach.

As far as error-reporting formats goes, it would definitely be good to have IDE compatibility as far as possible, but as humans are the lowest common denominator as far as parsing capability goes, human-readability is really important (he says, having written something that produces absolutely incomprehensible error messages...). It seems like you've got a lot of domain expertise in tool compatibility, @dabrahams, which I (and by the look of it, Ben) don't have to nearly the same degree, so your input to keep things tool-friendly as well will be invaluable.

mpalmer avatar Oct 16 '25 22:10 mpalmer

  1. "Accept defeat" and double-parse: yeah, not ideal, but a-v isn't a performance critical tool (as far as I know), so this wouldn't be the end of the world.

Agreed.

  1. saphy support in valico (via feature flag or otherwise): if valico were up for accepting such a thing, it would certainly seem to be a good approach. Presumably others could potentially benefit from spanned parsing, too, so it might not be a hard sell, but I'm sure it's a non-trivial amount of work. At the very least it seems worthwhile to reach out and take their temperature.

I've had a look at the valico codebase and I think this would, unfortunately, be quite a significant undertaking, while also maintaining support for serde_json::Value. I imagine dropping serde_json::Value support would not be an option as it would be a huge breaking change.

I could see two possible approaches for adding saphyr support to valico:

  1. Feature flag - compile different code depending on a cargo feature.
  2. A trait which implements the operations required by valico and can be implemented by both serde_json::Value and saphyr - but this is only a small step away from option 5, which could be a big win for the ecosystem if it gains adoption.
  1. conversion from saphy to serde_json::Value: pros/cons seem similar to double-parsing; in both cases there's a performance hit, which one is "worse" might not be obvious (or it might, I really don't know). The deciding factor between 1 and 3 is probably going to be as much about implementation effort as anything else.

Broadly agree. The downside compared to option 1 is that we always incur the overhead of conversion. With option 1, we can incur that cost only if it's needed - i.e. if validation passes, we don't pay.

  1. spanned parsing in serde: the completely impractical uber-nerd in me like this as the "most correct" solution, but hoo boy it would be a huge amount of work, could take literal years to get serde to adopt it, if they ever would.

Agreed.

  1. common format for parsed data: this is a real "boil the ocean" approach, and touches on something that has always annoyed me about serde: every format provides it's own Value type (or equivalent), when the parsed output should really be a generic serde-provided type.

Yes, that's always bothered me too, in particular because serde_json gets used as the defacto serde::Value type even when the use case is not inherently JSON related. It would be nice if there was a serde::Value. However, I don't know if that's practical because the wide scope of serde may make it impractical to agree on the details of that type; what is required for e.g. JSON may be vastly different from other data formats.

In any event, it feels like it'd be a huge amount of work, no matter where it landed, and yeah, it'd need buy-in from lots of people to make it valuable, which makes it a long-term approach, at best.

Yes, it's certainly quite a lot of work. That said, certianly less than option 4, and perhaps not so much more than option 2... After all, the hypothetical trait can be implemented without buyin from serde_json or saphy, and the implementation of the trait for something like serde_json or saphy would be fairly trivial (I already did a prototype to see if it's feasible and start feeling out the rough edges). The main challenge is (1) designing and implementing such a trait and associated standard Value type, and (2) adding support to saphy (which is about the same effort as option 2).

At a glance, if valico wanted saphy support and it wasn't a lot of effort, that seems like the ideal approach

I'd be interested to gauge valicos interest in options 2 and 5. If I were the maintainer of valico, I'd rather add support for a common trait which allows any unstructured data type to be validated, rather than adding support for saphyr and potentially opening myself up for a litany of "add support for some_other_crate::Value type" PRs/issues, and then be on the hook for maintaining those forever.

... otherwise whichever of 1 or 3 is easiest to do seems the best approach.

My gut feel is that option 1 or 3 are the only realistic options to get this done in a reasonable time frame, since they don't require changes to other crates. Of these two, both have pros and cons, and I don't have a strong opinion either way. I'd summarize the main trade-offs as:

  • Option 1 requires fewer resources in the error case, because parsing is more expensive than conversion.
  • Option 3 requires fewer resources in the success case, because we can skip spanned parsing entirely and un-spanned parsing is is more expensive

As far as error-reporting formats goes, it would definitely be good to have IDE compatibility as far as possible, but as humans are the lowest common denominator as far as parsing capability goes, human-readability is really important (he says, having written something that produces absolutely incomprehensible error messages...).

Agreed on both counts.

It seems like you've got a lot of domain expertise in tool compatibility, @dabrahams, which I (and by the look of it, Ben) don't have to nearly the same degree, so your input to keep things tool-friendly as well will be invaluable.

Yes, please! It's a non-trivial change so the more feedback we can gather on the implementation and possible use cases, the better.

bcheidemann avatar Oct 17 '25 11:10 bcheidemann

Btw, valico appears to be unmaintained since 2023 (or at least there have been no commits or new versions published) so it might not be possible to get buyin from the maintainer for any required upstream changes... There are other options for JSON schema validation, such as jsonschema, but adopting these would further increase the scope of this already quite significant change.

bcheidemann avatar Oct 17 '25 11:10 bcheidemann

Rich, multi-line errors
Annotated code snippets
Displaying messages for multiple related spans (e.g. "first used here")
Contextual information such as hints or suggestions

The gnu standard allows for all those things. The only thing it restricts is how you report line and column information and where the primary error message goes. You can put as much text after that line as you want, and several tools do. You can have supplementary notes with their own line and column information after the primary error. You can look at g++ for example, whose diagnostic requirements are at least as complex as Rust's. I believe clang also uses this format, at least given particular command-line flags. If you pass -diagnostic-style llvm to the Swift compiler you get the GNU format.

And let me add, the ability to quickly jump to and between errors while seeing the code they point to during development is a very human concern. Approximately no humans read error messages outside of an IDE. If there's truly some advantage to presenting humans a different view of that little bit of information that's specified by the Gnu standard, give the IDEs the normalized format thay can digest and allow IDE functionality/plugins to munge those standardized parts for presentation.

Follow-up: I can see how you might read the standard as restricting your ability to put error information outside that line, but their standard isn't saying you can't write “auxiliary information” to stderr. Regardless of whether we feel that's bending the rules too far, the point here is to be compatible with a format that tools already recognize, and they all recognize the output from those compilers.

dabrahams avatar Oct 17 '25 13:10 dabrahams

@dabrahams Thanks for clarifying! I'm sure we can at least wrangle annotate-snippets to produce a GNU compliant first line, or find another suitable crate which allows both helpful terminal output and GNU compliant error messages.

bcheidemann avatar Oct 20 '25 12:10 bcheidemann