json icon indicating copy to clipboard operation
json copied to clipboard

Span tracking

Open robsdedude opened this issue 9 months ago • 2 comments

Introducing optional span tracking (line, column, byte offset).

The feature is behind a new feature flag called spanned. It introduces

  • Spanned<T>, with wich you can get serde_json (when calling the appropriate deserializer) to inject span information into custom Deserialize types.
  • SpannedValue a copy of Value that uses Spanned<T> to provide span information.

This PR still needs a bit of fixing up:

  • [ ] clean out quick testing code
  • [ ] Fix relying on Read::position as its column isn't what we want and its line is slow to compute
    • [ ] While doing so, make sure to fix spans including leading white space

And then a good chunk of polishing:

  • [ ] DRY up the code (much of SpannedValue functionality is 90% copy-pasta from Value)
  • [ ] Write tests
  • [ ] Update/write docs
  • [ ] Maybe add the feature to the default features for the playground?

However, before I proceed with any of that, and since I've already put a good amount hours into this, I'd first like to get initial feedback on/review of the general design (API, outlined provided functionality).

Here's the gist of how it'd work

Sample code with output
#[cfg(test)]
mod test {
    use crate::spanned::*;
    use crate::*;

    fn show_spanned<T: core::fmt::Debug>(s: &Spanned<T>, source: &str) {
        use codespan_reporting::diagnostic::{Diagnostic, Label};
        use codespan_reporting::files::SimpleFiles;
        use codespan_reporting::term;
        use codespan_reporting::term::termcolor::{ColorChoice, StandardStream};

        let mut files = SimpleFiles::new();
        let file_id = files.add("input", source);
        let diagnostic = Diagnostic::note()
            .with_message(std::format!("Look, it's a {}", std::any::type_name::<T>()))
            .with_notes(std::vec![
                std::format!("{:?}", s.as_ref()),
                std::format!(
                    "From {}:{} ({}) to {}:{} ({})",
                    s.span().start.line,
                    s.span().start.column,
                    s.span().start.byte_offset,
                    s.span().end.line,
                    s.span().end.column,
                    s.span().end.byte_offset,
                )
            ])
            .with_labels(std::vec![Label::primary(file_id, s.byte_span())]);

        let writer = StandardStream::stderr(ColorChoice::Always);
        let config = codespan_reporting::term::Config::default();
        let mut writer_lock = writer.lock();

        term::emit(&mut writer_lock, &config, &files, &diagnostic).unwrap()
    }

    fn show_recursive(v: &Spanned<SpannedValue>, source: &str) {
        show_spanned(v, source);
        match v.get_ref() {
            SpannedValue::Array(values) => values.iter().for_each(|v| show_recursive(v, source)),
            SpannedValue::Object(map) => map.iter().for_each(|(k, v)| {
                show_spanned(k, source);
                show_recursive(v, source)
            }),
            _ => {}
        }
    }

    #[test]
    fn test_spanned_value() {
        let data = r#"
        {
            "name": "John Doe",
            "age": 42,
            "phones":    [
                "+44 1234567",
                "+44 2345678"
            ]
        }"#;

        let v: Spanned<SpannedValue> = from_str_spanned(data).unwrap();
        show_recursive(&v, data);
    }
}

Produces

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:1:1
  │  
1 │ ╭ 
2 │ │         {
3 │ │             "name": "John Doe",
4 │ │             "age": 42,
  · │
8 │ │             ]
9 │ │         }
  │ ╰─────────^
  │  
  = Object {Spanned { span: Span { start: SpanPosition { line: 3, column: 13, byte_offset: 23 }, end: SpanPosition { line: 3, column: 18, byte_offset: 29 } }, value: "name" }: Spanned { span: Span { start: SpanPosition { line: 3, column: 20, byte_offset: 30 }, end: SpanPosition { line: 3, column: 30, byte_offset: 41 } }, value: String("John Doe") }, Spanned { span: Span { start: SpanPosition { line: 4, column: 13, byte_offset: 55 }, end: SpanPosition { line: 4, column: 17, byte_offset: 60 } }, value: "age" }: Spanned { span: Span { start: SpanPosition { line: 4, column: 19, byte_offset: 61 }, end: SpanPosition { line: 4, column: 21, byte_offset: 64 } }, value: Number(42) }, Spanned { span: Span { start: SpanPosition { line: 5, column: 13, byte_offset: 78 }, end: SpanPosition { line: 5, column: 20, byte_offset: 86 } }, value: "phones" }: Spanned { span: Span { start: SpanPosition { line: 5, column: 22, byte_offset: 87 }, end: SpanPosition { line: 8, column: 13, byte_offset: 167 } }, value: Array [Spanned { span: Span { start: SpanPosition { line: 6, column: 17, byte_offset: 109 }, end: SpanPosition { line: 6, column: 29, byte_offset: 122 } }, value: String("+44 1234567") }, Spanned { span: Span { start: SpanPosition { line: 7, column: 17, byte_offset: 140 }, end: SpanPosition { line: 7, column: 29, byte_offset: 153 } }, value: String("+44 2345678") }] }}
  = From 1:1 (0) to 9:9 (177)

note: Look, it's a alloc::string::String
  ┌─ input:3:13
  │
3 │             "name": "John Doe",
  │             ^^^^^^
  │
  = "name"
  = From 3:13 (23) to 3:18 (29)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:3:20
  │
3 │             "name": "John Doe",
  │                    ^^^^^^^^^^^
  │
  = String("John Doe")
  = From 3:20 (30) to 3:30 (41)

note: Look, it's a alloc::string::String
  ┌─ input:4:13
  │
4 │             "age": 42,
  │             ^^^^^
  │
  = "age"
  = From 4:13 (55) to 4:17 (60)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:4:19
  │
4 │             "age": 42,
  │                   ^^^
  │
  = Number(42)
  = From 4:19 (61) to 4:21 (64)

note: Look, it's a alloc::string::String
  ┌─ input:5:13
  │
5 │             "phones":    [
  │             ^^^^^^^^
  │
  = "phones"
  = From 5:13 (78) to 5:20 (86)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:5:22
  │  
5 │               "phones":    [
  │ ╭──────────────────────^
6 │ │                 "+44 1234567",
7 │ │                 "+44 2345678"
8 │ │             ]
  │ ╰─────────────^
  │  
  = Array [Spanned { span: Span { start: SpanPosition { line: 6, column: 17, byte_offset: 109 }, end: SpanPosition { line: 6, column: 29, byte_offset: 122 } }, value: String("+44 1234567") }, Spanned { span: Span { start: SpanPosition { line: 7, column: 17, byte_offset: 140 }, end: SpanPosition { line: 7, column: 29, byte_offset: 153 } }, value: String("+44 2345678") }]
  = From 5:22 (87) to 8:13 (167)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:6:17
  │
6 │                 "+44 1234567",
  │                 ^^^^^^^^^^^^^
  │
  = String("+44 1234567")
  = From 6:17 (109) to 6:29 (122)

note: Look, it's a serde_json::spanned::value::SpannedValue
  ┌─ input:7:17
  │
7 │                 "+44 2345678"
  │                 ^^^^^^^^^^^^^
  │
  = String("+44 2345678")
  = From 7:17 (140) to 7:29 (153)

Closes: #637

robsdedude avatar Mar 11 '25 08:03 robsdedude

I can't comment as a maintainer of serde_json but as a maintainer of other crates, the best path of action is to dramatically cut down the size and number of concepts. For both review and your own good since there are many open TODO's in that commit and you're not going to want to address them all at once, but over time. Set yourself up for incremental small steps forward, I've seen smaller efforts just lose momentum.

For instance, SpannedValue and the Spanned<T>: Deserialize seem quite orthogonal to each other.


SpannedValue on its own seems very concrete. There is also clear prior reference to its implementation style in the crate. This already calls into question the need for a bool-flag in the deserializer (compare RawValue doesn't need to be turned on in addition to the feature flag). If it gets a feature flag then I don't see why SpannedValue and Spanned<T> should have overlapping ones.

It would also be helpful to expand on the motivating use cases in tests some more. For users that are are deriving Deserialize there are already options:

  • is actually a minimal implementation satisfying the referenced issue
    • for diagnostics, downstream crate want to issue help that refers to spans in the json input after semantically interpreting the json
  • for rewriting, the spans allows us to programmatically replace the value
  • the span allows getting a &'lt RawValue after deserialization

Each of these deserve asserts that demonstrate the additional information is valuable. If you can show that this can do multiple uses cases, it's very convincing. Needing lots of maintainance for niche gains will not be easy to add to any OSS crate. The show call is just there, a proper test is more convincing.

Also probably cut the impl Serializer and impl Deserializer for Spanned<T> initially unless you can link them to one of the above. Those adds hundreds of lines of their own, need separate test coverage, have some complex question with regards to future versions, and should at least be a separate commit.


Then in comparison, I'm not certain the motivation for SpannedValue is nearly as good. Sure, adding span information to all parts of a json Value interesting. But you're littering the overhead everywhere (performance, memory use and thousands of implementation lines) and the question is really for which use case this is the best tool. That doesn't seem nearly as clear to me. Is it even realistic to expect one to build up such a value by hand?

197g avatar Jun 01 '25 18:06 197g

I only found this PR now, after submitting https://github.com/serde-rs/json/pull/1266 which addresses roughly the same problem but in a much more minimal way (does not attempt to track lines/columns and does not touch the Value type, only tracks byte offset for explicitly wrapped serde_spanned::Spanned<T> values).

gnosek avatar Jun 23 '25 07:06 gnosek