toml
toml copied to clipboard
Parse mode that returns all errors and an as-valid-as-possible Document
Background
In configure_me I recently added support for very nice error messages using codespan-reporting. One of the nice features is ability to report multiple errors. However the requirement to not duplicate keys degrades the experience. Note that configure_me input is even more strict: it requires that keys are not duplicate among multiple items. This is detected manually and produces output like this:
error: the option `foo` appears more than once
┌─ unknown file:4:26
│
3 │ conf_file_param = "foo"
│ ----- the option was first defined here
4 │ conf_dir_param = "foo"
│ ^^^^^ the option is repeated here
·
13 │ [param.foo]
│ ^^^ ... and here
·
16 │ [switch.foo]
│ ^^^ ... and here
As you can see, this is very programmer-friendly. Note that this also outputs other errors such as invalid names of parameters. However if you duplicate a key within a single table you get a message like this:
┌─ unknown file:16:9
│
16 │ [switch.foo]
│ ^ redefinition of table `switch.foo` for key `switch.foo` at line 16 column 9
Not only does it limit the output to a single error, it's also missing information about which other place defines the same key and span points to a weird place as well.
Proposed solution
The easiest for me would be adding set_allow_duplicate_keys() method similar to the existing methods. Then the existing code would work just fine. The method should probably have documentation that users must reject duplicates themselves to adhere to Toml specification and the method is only provided for custom error handling.
Alternative
I would find it understandable if you think that despite warnings in the doc people would abuse it to implement "robustness principle" which is actually quite bad. If you'd rather not provide an abuse-able tool then at least improving the error type to contain enough information to show all duplicated fields within a single map would be nice.
If I understand correctly, and the proposed solution is to add support for a non-standard flavor of Toml which is disabled by default. And the alternative solution is to improve the error message on duplicate keys. I'd be in favor of the latter.
I'd rather we explore any other option before going down the route of set_allow_duplicate_keys.
We already need to add spans for fields. Once we have that, we could easily look up the first span and report both.
We could then collect errors as we go, accumulating them, rather than stopping immediately. Some questions or concerns I have that would need to be addressed
- How much does it impact performance of the success case?
- Which errors do we treat as recoverable / accumulatable? Only non-syntactic ones like duplicate keys?
- How do we track this information as we go?
How much does it impact performance of the success case?
For my projects it certainly won't be enough for me to care. Parsing would have to be maybe 1000x slower to be a problem. If it has to be behind a feature flag so that perf-sensitive people are not harmed, I will happily turn on the feature flag.
Only non-syntactic ones like duplicate keys?
I believe so. I really don't like how rustc tries to recover syntactic errors causing printing of more errors that were caused by the syntactic ones.
How do we track this information as we go?
In configure_me I used a special error type for duplicates and I track the spans of all occurrences in Vec<Span>. I don't see anything wrong with it.
For my projects it certainly won't be enough for me to care. Parsing would have to be maybe 1000x slower to be a problem. If it has to be behind a feature flag so that perf-sensitive people are not harmed, I will happily turn on the feature flag.
Cargo is the main performance-sensitive project that I'm aware of and we need to be sensitive to as it has to parse hundreds of toml files per invocation. I have been trying to nerd snipe the performance team into looking at this and seeing about doing a binary cache for immutable manifests (ie from a registry)
Also, I'll note that my main toml-related priority right now is moving toml-rs to be on top of toml_edit. Anything related to this comes after for me, including reviewing people making changes to toml-rs as that takes both review time and is throwaway.
Just noticed I forgot to say that returning error will prevent me from reporting all duplicates across different parts of the toml. So it's better than the current situation but worse than the ideal situation. And I guess adding logic to toml to handle this would be too much? I can't even imagine what it would look like.
Something I've been considering is to expose error recovery to the caller. meaning we'd have something like parser_recovery(data: &str) -> (Document, Vec<Errors>).
Sounds interesting, what would the Document type provide?
Document would be what toml_edit::Document we were able to load, despite the errors.
That sounds quite good, any reason to not return T: Deserialize instead?
I'm assuming you are referring to the toml crate. In #340 we'll be swapping out the parser for toml_edit and the lowest level place we'll need to handle error recovery will be in toml_edit. We'll then need to decide how much to automatically expose up the stack. Keep in mind that I'm hoping to see full error recovery, including from syntax errors. This will greatly diminish the value of directly supporting T: Deserialize as syntax errors will likely lead to schema errors. Most likely, you'll want to check the errors and decide if you want to still deserialize or report the errors as-is.
And to clarify, with us swapping tomls parser in #340, I'd like to avoid throw-away work before then, like implementing or reviewing someone else implementing error recovery in the existing toml crate.