rust-csv icon indicating copy to clipboard operation
rust-csv copied to clipboard

document usage of #[serde(flatten)] more thoroughly

Open LPGhatguy opened this issue 5 years ago • 13 comments

I got a cool bug report on a project of mine that uses the csv crate: https://github.com/LPGhatguy/rojo/issues/145

I deserialize files into a Vec of a struct with a shape like this:

#[derive(Debug, Deserialize)]
struct SomethingEntry {
    name: String,

    #[serde(flatten)]
    values: HashMap<String, String>,
}

The actual structure uses #[serde(flatten)] to capture extra columns since it's used in a localization format for a game engine.

What version of the csv crate are you using?

1.0.5

Briefly describe the question, bug or feature request.

Deserializing a field that uses #[serde(flatten)] on a HashMap<String, String> fails if the value in a record looks like a number.

Include a complete program demonstrating a problem.

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=8d7a007589ba375a79a55b1d825ffdb5

extern crate csv; // 1.0.5
extern crate serde_derive; // 1.0.88
extern crate serde; // 1.0.88

use std::collections::HashMap;
use serde_derive::Deserialize;

#[derive(Debug, Deserialize)]
struct SomethingEntry {
    name: String,
    
    #[serde(flatten)]
    values: HashMap<String, String>,
}

fn main() {
    let source = r#"name,stat
name,Benjamin
maxHealth,300"#;

    let mut rdr = csv::Reader::from_reader(source.as_bytes());
    let records: Vec<SomethingEntry> = rdr.deserialize()
        .map(Result::unwrap)
        .collect();
    
    println!("{:?}", records);
}

What is the observed behavior of the code above?

The program panics, since csv returns an error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Deserialize { pos: Some(Position { byte: 24, line: 3, record: 2 }), err: DeserializeError { field: None, kind: Message("invalid type: integer `300`, expected a string") } })', src/libcore/result.rs:997:5
note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

What is the expected or desired behavior of the code above?

The first entry should be have a name of name and values set to {"stat": "Benjamin"}.

The second entry should have a name of maxHealth and values set to {"stat": "300"}.

Since the code works for most inputs, I don't expect it to start failing to parse just because the user put something into the spreadsheet that looks like a number.

LPGhatguy avatar Apr 04 '19 21:04 LPGhatguy

Aw darn, this also fails with strings like true too.

I made a failing test in a local branch, and after about 10 minutes of poking around this feels like it might be to blame:

https://github.com/BurntSushi/rust-csv/blob/216962892ec8ea9f8df2440eb58e6e6b94b17157/src/deserializer.rs#L205-L224

LPGhatguy avatar Apr 04 '19 21:04 LPGhatguy

Great bug report! Thanks. And thank you for the good reproduction.

As far as I can tell, the implementation of infer_deserialize (which is what's used to implement deserialize_any) is correct.

The csv crate doesn't really have anything to do with how serde(flatten) works, which seems to be invoking deserialize_any on the deserializer. It wasn't clear to me why the flatten feature uses deserialize_any, but reading some of the issues on the serde issue tracker helped me understand. For example, this comment in particular: https://github.com/serde-rs/serde/issues/1346#issuecomment-410531952 --- Basically, the idea is that flatten really has to be resolved dynamically, since any number of fields could be inserted at runtime, and their expected types can't be known up front. So, the flatten feature basically has to fall back on to the deserializer to tell Serde what the types of the things actually are.

For things like JSON, this works out great because the type of a value is determined by the JSON syntax itself. But CSV is completely untyped, so we have to effectively "guess" what the value is by trying various things. This is the infer_deserialize function that you pointed out.

So popping up a level at this point, it's now actually possible to see a solution. The problem is that the csv crate is detecting a number in the data, but you are asking for a String. Now normally this isn't a problem for normal struct deserialization. That is, if you specify that a field foo has type String, then Serde knows that it should statically parse that field as a string. csv has no problem with that. But in this case, we don't have static knowledge---at least in the Serde infrastructure---of what the caller wants. So at this point, all you need to do is use a type that can "catch" everything:

#[derive(Deserialize, Debug, PartialEq)]
#[serde(rename_all = "snake_case")]
#[serde(untagged)]
enum Value {
    Bool(bool),
    U64(u64),
    I64(i64),
    Float(f64),
    String(String),
}

Bringing it all together:

extern crate csv; // 1.0.5
extern crate serde_derive; // 1.0.88
extern crate serde; // 1.0.88

use std::collections::HashMap;
use serde_derive::Deserialize;

#[derive(Debug, Deserialize)]
struct SomethingEntry {
    name: String,

    #[serde(flatten)]
    values: HashMap<String, Value>,
}

#[derive(Deserialize, Debug, PartialEq)]
#[serde(rename_all = "snake_case")]
#[serde(untagged)]
enum Value {
    Bool(bool),
    U64(u64),
    I64(i64),
    Float(f64),
    String(String),
}

fn main() {
    let source = r#"name,stat
name,Benjamin
maxHealth,300"#;

    let mut rdr = csv::Reader::from_reader(source.as_bytes());
    let records: Vec<SomethingEntry> = rdr.deserialize()
        .map(Result::unwrap)
        .collect();

    println!("{:?}", records);
}

And the output is:

[SomethingEntry { name: "name", values: {"stat": String("Benjamin")} }, SomethingEntry { name: "maxHealth", values: {"stat": U64(300)} }]

Popping up a level, it does kind of feel like the flatten feature of Serde should be able to figure this out. With that said, I am a mere user of Serde and its constraints are complex and not something I am an expert in.

With respect to the csv crate, I'm going to leave this issue open, because I think this would make a great example to add to the cookbook. Namely, "parse some subset of known fields, but put everything else in a catch-all map" feels like a pretty useful tool to have.

BurntSushi avatar Apr 04 '19 22:04 BurntSushi

Thanks for the super quick response.

This gives me a way forward for my project, which makes me really happy. I guess I also have the option of implementing Deserialize by hand if I want to keep the struct stringy!

I agree that this seems like it should work intuitively with just flatten. I wonder if there's room for a trait to implement on keyed collections like HashMap that would allow them to avoid the deserialize_any path. Thinking on that, that might need specialization or other interesting tricks with the way my mental model of Serde is laid out. 😄

LPGhatguy avatar Apr 04 '19 23:04 LPGhatguy

Oh, this has some other interesting fallout -- things that parse as numbers will be rounded when they try to fit into those types.

Given the example above, we can plug in this CSV:

name,stat
maxHealth,123456789123456789123456789123456789123456789123456789123456789123456789123456789

The number value will turn into:

Float(123456789123456800000000000000000000000000000000000000000000000000000000000000000.0)

...and thus turns back into a string in the rounded form. Luckily, that's enough to represent the largest phone numbers I'm aware of, but I'm kind of afraid this'll cause subtle behavior changes in other places that people aren't expecting.

Similarly, though more unlikely, I'm kind of afraid of rounding affecting values that look like floating point numbers.

EDIT: 😱 strings with leading zeroes have their leading zeroes stripped, so some phone numbers get clobbered when being treated as numbers!!

LPGhatguy avatar Apr 04 '19 23:04 LPGhatguy

I don't really think there's anything I can do about this, honestly. This is really a matter for the flatten serde API. It fundamentally works by asking for the deserializer to infer the type of the next value, and the csv deserializer does the best it can. But it can never be perfect because csv is completely untyped. So as long as you're using Serde functionality that relies on inferring types from csv data, it will always be fundamentally flawed. So my suggestion is that this is either something you live with, or you file a bug against serde.

With that said...

Don't forget, there is always another choice, although an unsatisfactory one: don't use serde.

Another choice is to combine serde and raw record parsing. It's a bit circuitous, but you can:

  1. Parse CSV into StringRecords.
  2. Use the StringRecord::deserialize method to decode into your struct. In this case, you'd abandon the flatten usage completely and just deserialize known fields.
  3. Look for fields in the StringRecord that weren't deserialized, and collect them into a map (or whatever). You would need to maintain a known list of field names separate from your types so that you know which fields weren't decoded.

BurntSushi avatar Apr 05 '19 00:04 BurntSushi

I've run into a similar problem with infer_deserialize when deserializing Decimal from the rust-decimal crate. I opened an issue over there but it occurs to me that this could be fixed in rust-csv if it had a compile time feature to disable type inference.

I tried patching rust-csv to change the implementation of infer_deserialize to be simply:

    fn infer_deserialize<'de, V: Visitor<'de>>(
        &mut self,
        visitor: V,
    ) -> Result<V::Value, DeserializeError> {
        let x = self.next_field()?;
        visitor.visit_str(x)
    }

and I have verified that it fixes both my issue and the test case that @LPGhatguy provided above.

It seems legitimate to want to consider a CSV row as simply a HashMap<String, String>, without trying to inspect the contents of those strings. @BurntSushi, would you consider adding a Cargo feature to rust-csv that replaces the implementation of infer_deserialize as above?

bchallenor avatar Oct 18 '20 12:10 bchallenor

@bchallenor Could you please provide a complete Rust program that compiles, along with any relevant input, your expected output and the actual output? If possible, please briefly describe the problem you're trying to solve. (To be honest, I think it might be better to open a new issue, unless you're 100% confident that this one is the same.)

Your suggested work-around by adding a crate feature to disable inference is probably a non-starter because it likely breaks other functionality. This sort of subtle interaction between features via Cargo features is not the kind of complexity I'd like to add to the crate. Instead, I'd like to look at your problem from first principles, however, I do not see a complete Rust program from you here or in the linked issue.

BurntSushi avatar Oct 18 '20 14:10 BurntSushi

Sure, I'll put together an example in a new issue, thanks.

bchallenor avatar Oct 18 '20 14:10 bchallenor

I feel like this is a related issue. In fact I'd consider it more concrete, since HashMap accepts any key, including the key concretely used in the wrapper, while my example only accepts specific keys. flatten breaks Option:

const FILE: &[u8] = br#"root_num,root_string,num,string
1,a,1,a
,,1,
1,a,,a
"#;
#[derive(Debug, serde::Deserialize)]
struct Row {
    root_num: Option<u32>,
    root_string: Option<String>,
    #[serde(flatten)]
    nest: Nest,
}
#[derive(Debug, serde::Deserialize)]
struct Nest {
    num: Option<u32>,
    string: Option<String>,
}
fn main() {
    for x in csv::Reader::from_reader(std::io::Cursor::new(FILE)).into_deserialize() {
        match x {
            Ok(r @ Row { .. }) => println!("{:?}", r),
            Err(err) => println!("{}", err),
        }
    }
}

outputs

Row { root_num: Some(1), root_string: Some("a"), nest: Nest { num: Some(1), string: Some("a") } }
Row { root_num: None, root_string: None, nest: Nest { num: Some(1), string: Some("") } }
CSV deserialize error: record 3 (line: 4, byte: 45): invalid type: string "", expected u32

Look at it go! Also note the second row that parses nest.string: Some("") but root_string: None

qm3ster avatar Dec 02 '20 11:12 qm3ster

I am trying out csv writer with a flattened rust struct. But even a minimal example is throwing an error.

use csv::Writer;
use serde::Serialize;
use serde_with::with_prefix;

#[derive(Serialize, Default)]
struct B {
	field: usize,
}

#[derive(Serialize, Default)]
struct A {
	other_field: usize,
	#[serde(flatten, with = "b")]
	b: B,
}

with_prefix!(b "b.");

let data = A::default();
let mut wtr = Writer::from_writer(vec![]);
wtr.serialize(data).unwrap();
let csv_line = String::from_utf8(wtr.into_inner().unwrap()).unwrap();
println!("{}", csv_line);

Neither struct has any dynamic fields but I'm getting Error(Serialize("serializing maps is not supported... at wtr.serialize(data).unwrap(). Does this not work well when using flatten and with_prefix? Am I missing something.

twitu avatar Feb 14 '22 13:02 twitu

@twitu does it work without the with = "b"?

qm3ster avatar Feb 14 '22 22:02 qm3ster

use csv::Writer;
use serde::Serialize;

#[derive(Serialize, Default)]
struct B {
	field: usize,
}

#[derive(Serialize, Default)]
struct A {
	other_field: usize,
	#[serde(flatten)]
	b: B,
}

let data = A::default();
let mut wtr = Writer::from_writer(vec![]);
wtr.serialize(data).unwrap();
let csv_line = String::from_utf8(wtr.into_inner().unwrap()).unwrap();
println!("{}", csv_line);

Nope, this is failing as well. So using flatten with rust csv is not at all possible?

twitu avatar Feb 15 '22 13:02 twitu