Span tracking
Introducing optional span tracking (line, column, byte offset).
The feature is behind a new feature flag called spanned. It introduces
Spanned<T>, with wich you can get serde_json (when calling the appropriate deserializer) to inject span information into customDeserializetypes.SpannedValuea copy ofValuethat usesSpanned<T>to provide span information.
This PR still needs a bit of fixing up:
- [ ] clean out quick testing code
- [ ] Fix relying on
Read::positionas itscolumnisn't what we want and itslineis slow to compute- [ ] While doing so, make sure to fix spans including leading white space
And then a good chunk of polishing:
- [ ] DRY up the code (much of
SpannedValuefunctionality is 90% copy-pasta fromValue) - [ ] Write tests
- [ ] Update/write docs
- [ ] Maybe add the feature to the default features for the playground?
However, before I proceed with any of that, and since I've already put a good amount hours into this, I'd first like to get initial feedback on/review of the general design (API, outlined provided functionality).
Here's the gist of how it'd work
Sample code with output
#[cfg(test)]
mod test {
use crate::spanned::*;
use crate::*;
fn show_spanned<T: core::fmt::Debug>(s: &Spanned<T>, source: &str) {
use codespan_reporting::diagnostic::{Diagnostic, Label};
use codespan_reporting::files::SimpleFiles;
use codespan_reporting::term;
use codespan_reporting::term::termcolor::{ColorChoice, StandardStream};
let mut files = SimpleFiles::new();
let file_id = files.add("input", source);
let diagnostic = Diagnostic::note()
.with_message(std::format!("Look, it's a {}", std::any::type_name::<T>()))
.with_notes(std::vec![
std::format!("{:?}", s.as_ref()),
std::format!(
"From {}:{} ({}) to {}:{} ({})",
s.span().start.line,
s.span().start.column,
s.span().start.byte_offset,
s.span().end.line,
s.span().end.column,
s.span().end.byte_offset,
)
])
.with_labels(std::vec![Label::primary(file_id, s.byte_span())]);
let writer = StandardStream::stderr(ColorChoice::Always);
let config = codespan_reporting::term::Config::default();
let mut writer_lock = writer.lock();
term::emit(&mut writer_lock, &config, &files, &diagnostic).unwrap()
}
fn show_recursive(v: &Spanned<SpannedValue>, source: &str) {
show_spanned(v, source);
match v.get_ref() {
SpannedValue::Array(values) => values.iter().for_each(|v| show_recursive(v, source)),
SpannedValue::Object(map) => map.iter().for_each(|(k, v)| {
show_spanned(k, source);
show_recursive(v, source)
}),
_ => {}
}
}
#[test]
fn test_spanned_value() {
let data = r#"
{
"name": "John Doe",
"age": 42,
"phones": [
"+44 1234567",
"+44 2345678"
]
}"#;
let v: Spanned<SpannedValue> = from_str_spanned(data).unwrap();
show_recursive(&v, data);
}
}
Produces
note: Look, it's a serde_json::spanned::value::SpannedValue
┌─ input:1:1
│
1 │ ╭
2 │ │ {
3 │ │ "name": "John Doe",
4 │ │ "age": 42,
· │
8 │ │ ]
9 │ │ }
│ ╰─────────^
│
= Object {Spanned { span: Span { start: SpanPosition { line: 3, column: 13, byte_offset: 23 }, end: SpanPosition { line: 3, column: 18, byte_offset: 29 } }, value: "name" }: Spanned { span: Span { start: SpanPosition { line: 3, column: 20, byte_offset: 30 }, end: SpanPosition { line: 3, column: 30, byte_offset: 41 } }, value: String("John Doe") }, Spanned { span: Span { start: SpanPosition { line: 4, column: 13, byte_offset: 55 }, end: SpanPosition { line: 4, column: 17, byte_offset: 60 } }, value: "age" }: Spanned { span: Span { start: SpanPosition { line: 4, column: 19, byte_offset: 61 }, end: SpanPosition { line: 4, column: 21, byte_offset: 64 } }, value: Number(42) }, Spanned { span: Span { start: SpanPosition { line: 5, column: 13, byte_offset: 78 }, end: SpanPosition { line: 5, column: 20, byte_offset: 86 } }, value: "phones" }: Spanned { span: Span { start: SpanPosition { line: 5, column: 22, byte_offset: 87 }, end: SpanPosition { line: 8, column: 13, byte_offset: 167 } }, value: Array [Spanned { span: Span { start: SpanPosition { line: 6, column: 17, byte_offset: 109 }, end: SpanPosition { line: 6, column: 29, byte_offset: 122 } }, value: String("+44 1234567") }, Spanned { span: Span { start: SpanPosition { line: 7, column: 17, byte_offset: 140 }, end: SpanPosition { line: 7, column: 29, byte_offset: 153 } }, value: String("+44 2345678") }] }}
= From 1:1 (0) to 9:9 (177)
note: Look, it's a alloc::string::String
┌─ input:3:13
│
3 │ "name": "John Doe",
│ ^^^^^^
│
= "name"
= From 3:13 (23) to 3:18 (29)
note: Look, it's a serde_json::spanned::value::SpannedValue
┌─ input:3:20
│
3 │ "name": "John Doe",
│ ^^^^^^^^^^^
│
= String("John Doe")
= From 3:20 (30) to 3:30 (41)
note: Look, it's a alloc::string::String
┌─ input:4:13
│
4 │ "age": 42,
│ ^^^^^
│
= "age"
= From 4:13 (55) to 4:17 (60)
note: Look, it's a serde_json::spanned::value::SpannedValue
┌─ input:4:19
│
4 │ "age": 42,
│ ^^^
│
= Number(42)
= From 4:19 (61) to 4:21 (64)
note: Look, it's a alloc::string::String
┌─ input:5:13
│
5 │ "phones": [
│ ^^^^^^^^
│
= "phones"
= From 5:13 (78) to 5:20 (86)
note: Look, it's a serde_json::spanned::value::SpannedValue
┌─ input:5:22
│
5 │ "phones": [
│ ╭──────────────────────^
6 │ │ "+44 1234567",
7 │ │ "+44 2345678"
8 │ │ ]
│ ╰─────────────^
│
= Array [Spanned { span: Span { start: SpanPosition { line: 6, column: 17, byte_offset: 109 }, end: SpanPosition { line: 6, column: 29, byte_offset: 122 } }, value: String("+44 1234567") }, Spanned { span: Span { start: SpanPosition { line: 7, column: 17, byte_offset: 140 }, end: SpanPosition { line: 7, column: 29, byte_offset: 153 } }, value: String("+44 2345678") }]
= From 5:22 (87) to 8:13 (167)
note: Look, it's a serde_json::spanned::value::SpannedValue
┌─ input:6:17
│
6 │ "+44 1234567",
│ ^^^^^^^^^^^^^
│
= String("+44 1234567")
= From 6:17 (109) to 6:29 (122)
note: Look, it's a serde_json::spanned::value::SpannedValue
┌─ input:7:17
│
7 │ "+44 2345678"
│ ^^^^^^^^^^^^^
│
= String("+44 2345678")
= From 7:17 (140) to 7:29 (153)
Closes: #637
I can't comment as a maintainer of serde_json but as a maintainer of other crates, the best path of action is to dramatically cut down the size and number of concepts. For both review and your own good since there are many open TODO's in that commit and you're not going to want to address them all at once, but over time. Set yourself up for incremental small steps forward, I've seen smaller efforts just lose momentum.
For instance, SpannedValue and the Spanned<T>: Deserialize seem quite orthogonal to each other.
SpannedValue on its own seems very concrete. There is also clear prior reference to its implementation style in the crate. This already calls into question the need for a bool-flag in the deserializer (compare RawValue doesn't need to be turned on in addition to the feature flag). If it gets a feature flag then I don't see why SpannedValue and Spanned<T> should have overlapping ones.
It would also be helpful to expand on the motivating use cases in tests some more. For users that are are deriving Deserialize there are already options:
- is actually a minimal implementation satisfying the referenced issue
- for diagnostics, downstream crate want to issue help that refers to spans in the json input after semantically interpreting the json
- for rewriting, the spans allows us to programmatically replace the value
- the span allows getting a
&'lt RawValueafter deserialization
Each of these deserve asserts that demonstrate the additional information is valuable. If you can show that this can do multiple uses cases, it's very convincing. Needing lots of maintainance for niche gains will not be easy to add to any OSS crate. The show call is just there, a proper test is more convincing.
Also probably cut the impl Serializer and impl Deserializer for Spanned<T> initially unless you can link them to one of the above. Those adds hundreds of lines of their own, need separate test coverage, have some complex question with regards to future versions, and should at least be a separate commit.
Then in comparison, I'm not certain the motivation for SpannedValue is nearly as good. Sure, adding span information to all parts of a json Value interesting. But you're littering the overhead everywhere (performance, memory use and thousands of implementation lines) and the question is really for which use case this is the best tool. That doesn't seem nearly as clear to me. Is it even realistic to expect one to build up such a value by hand?
I only found this PR now, after submitting https://github.com/serde-rs/json/pull/1266 which addresses roughly the same problem but in a much more minimal way (does not attempt to track lines/columns and does not touch the Value type, only tracks byte offset for explicitly wrapped serde_spanned::Spanned<T> values).