rust-csv icon indicating copy to clipboard operation
rust-csv copied to clipboard

Support serializing of maps

Open birktj opened this issue 7 years ago • 30 comments

I want to use this library in order to import/export csv files to/from a key-value database, I use HashMap<String, MyValueType> to store this data. This works fine for deserializing, however I just discovered that it was not supported for serializing, something which feels like an inconsistency in the library. This is a feature that I would appreciate.

birktj avatar Nov 18 '17 22:11 birktj

I don't think this is a matter of will, but rather, a missing specification. If someone wants this feature, then they need to propose how it should work, in detail. A principle question that must be answered is the ordering of columns (which ordering?) and whether or not it should be deterministic (and if so, how).

BurntSushi avatar Nov 18 '17 22:11 BurntSushi

(IMO, I think a valid answer to this question is "the library doesn't support it and callers should convert their data type to something else that is already supported." If you're creating a hash map for every record, then you've already given up on making it fast, so the cost of an extra conversion step is just a little more code in the form of a small helper funciton.)

BurntSushi avatar Nov 18 '17 22:11 BurntSushi

You are probably right, I just realized that this was just the push I needed to actually implement a kind of sorted hashmap so that I can get the csv out in the same format that I inserted it in. However it still seems like an inconsistency with support only for deserializing of maps. I don't think that serialization of at least hashmaps has to be deterministic, when you first choose to use them you are probably no longer interested in ordering.

birktj avatar Nov 18 '17 23:11 birktj

I don't really buy inconsistency arguments just for the sake of consistency, particularly when there are complications. Reading into a hash map and writing from a hash map have fundamentally different concerns when it comes to CSV data.

I also don't really agree about determinism. Non-determinism is often surprising when you hit it, and while it is appropriate for HashMap to have arbitrary order, I wouldn't want to propagate that out further.

The reason why I think the status quo is a reasonable option is because it allows the library to not make a choice about determinism and ordering, and forces the caller to do it.

With that said, here are other thoughts:

  1. It seems like BTreeMap should be allowed, since the ordering is made clear by the type.
  2. It may be worthwhile to attempt to pop up a level and focus on the problem you're trying to solve. If the problem is, "I want a simple way to read CSV data without caring much about its structure, but still retain the ability to inspect it dynamically via header field names," (which I'm assuming is what you're doing if you're using a HashMap), then there are other approaches here. For example, we might consider a new record type that retains ordering information and exposes a map-like API.

BurntSushi avatar Nov 18 '17 23:11 BurntSushi

I think that serializing maps may be required in order to work with #[serde(flatten)] attribute.

synek317 avatar Jun 04 '18 13:06 synek317

@synek317 It is, which leads me to going on this issue. I need to have flattened stuff which should work completely fine in a CSV format.

Kampfkarren avatar Nov 18 '18 05:11 Kampfkarren

I don't see why serializing maps is required in order for serde(flatten) to work. Can you provide an example?

BurntSushi avatar Nov 18 '18 06:11 BurntSushi

https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=8312d6a3ee38bdaed677865367405830

Removing the baz: Baz bit serializes fine.

Kampfkarren avatar Nov 18 '18 06:11 Kampfkarren

Interesting. I didn't expect that. Thanks.

On Sun, Nov 18, 2018, 01:10 Kampfkarren <[email protected] wrote:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=8312d6a3ee38bdaed677865367405830

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/rust-csv/issues/98#issuecomment-439670448, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34pBCiPuqjP9lfAqtIMm609ZXXKhVks5uwPnUgaJpZM4QjLWL .

BurntSushi avatar Nov 18 '18 06:11 BurntSushi

Would you know how to solve it? It's a complete blocker of this right now for me 😕

Kampfkarren avatar Nov 18 '18 06:11 Kampfkarren

https://github.com/Kampfkarren/rust-csv/commit/95cea44329d97179ba998a2c30285eeda9c84ac2

This makes flattened maps work, but probably gives weird behavior for unflattened maps which is why I'm not PRing it.

Kampfkarren avatar Nov 18 '18 06:11 Kampfkarren

@Kampfkarren I don't know off the top of my head. I'm not familiar with how serde(flatten) works. I would have thought that it would just generate more code at compile time as if everything was a struct instead of using something dynamic like a map. Moreover, I'm fairly certain that serde(flatten) worked in the csv crate at one point in time, so I'd want to do a bit of code archaeology to figure out what went wrong.

It's a complete blocker of this right now for me

I don't understand this either. serde(flatten) is I'm sure very useful, but AIUI, it's still just a convenience.

This makes flattened maps work, but probably gives weird behavior for unflattened maps which is why I'm not PRing it.

Right. If making serde(flatten) work requires supporting arbitrary maps work, then we need to nail down a specification for maps first.

BurntSushi avatar Nov 18 '18 13:11 BurntSushi

cc @dtolnay --- Do you have any insights on changes in how serde(flatten) works? I could have sworn it work with the CSV crate at one point in time, but now it seems to require the serializer to support maps? Maybe I'm misremembering and only tested serde(flatten) with deserialization though.

BurntSushi avatar Nov 18 '18 14:11 BurntSushi

We haven't changed how serde(flatten) works.

As a workaround for https://github.com/BurntSushi/rust-csv/issues/98#issuecomment-439670448 you can provide the Serialize impl yourself.

dtolnay avatar Nov 18 '18 22:11 dtolnay

I support this feature request, so I report here my proposal specification for serialization of structs and maps.

In order for a struct or map to be serializable to CSV, all value fields should either be scalars or equal length 1D arrays. If this is verified, the CSV output should be modeled after the JSON output, with the additional possibility to specify either with or without header line. The output order should follow the standard iteration order, so output for maps would be unordered unless an ordered map type is used (such as IndexMap).

Here is an example using JSON:

#[derive(Serialize)]
struct MyStruct {
    a: Vec<i32>,
    b: Vec<i32>,
    c: Vec<i32>,
}

let s = MyStruct {
    a: vec![1, 2, 3],
    b: vec![4, 5, 6],
    c: vec![7, 8, 9],
};

let mut m = IndexMap::with_capacity(3);
m.insert("a", vec![1, 2, 3]);
m.insert("b", vec![4, 5, 6]);
m.insert("c", vec![7, 8, 9]);

println!("{}", serde_json::to_string(&s).unwrap());
println!("{}", serde_json::to_string(&m).unwrap());

Note: enable feature "serde-1" of the indexmap crate for this to work.

This is the JSON output (ordered because I used IndexMap, otherwise the map output would be unordered):

{"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]}
{"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]}

According to this proposal specification, the corresponding CSV output should be:

a,b,c
1,4,7
2,5,8
3,6,9
a,b,c
1,4,7
2,5,8
3,6,9

lucatrv avatar Jan 31 '19 22:01 lucatrv

Here is my use case description. I'm implementing a plotting library (which I plan to publish on crates.io) which relies on the following data model:

use serde::{Serialize, Serializer};
use serde_json::{Map, Value};

#[derive(Debug, PartialEq, Clone, Copy, Default)]
struct DataVar<'a, T>
where
    T: Serialize,
{
    name: &'a str,
    data: &'a [T],
}

impl<'a, T> DataVar<'a, T>
where
    T: Serialize,
{
    pub fn new(name: &'a str, data: &'a [T]) -> Self {
        Self { name, data }
    }
}

#[derive(Debug, PartialEq, Clone, Default)]
struct DataSet<'a> {
    name: &'a str,
    datavars: Map<String, Value>,
}

macro_rules! dataset {
    ($name:expr, $($datavar:expr),*) => {
        {
            let mut datavars = Map::new();
            $(
                datavars.insert($datavar.name.to_string(), serde_json::to_value($datavar.data).unwrap());
            )*
            DataSet{
                name: $name,
                datavars,
            }
        }
    };
}

I use the serde_json::Value type for convenience, in order to cast different &[T] where T: Serialize types to the same map. The serde_json crate is loaded with feature "preserve_order", so maps are ordered.

The user defines new DataVars and DataSets as follows:

let x = DataVar::new("X", &[1., 2., 3.]);
let y = DataVar::new("Y", &[true, false, true]);
let z = DataVar::new("Z", &["a", "b", "c"]);
let ds = dataset!("DS", x, y, z);

DataVars can easily be serialized to a format supported by serde and saved as follows (in this example they are serialized to JSON and CSV and printed to screen):

#[derive(Debug)]
pub enum Format {
    JsonCompact,
    JsonPretty,
    Csv,
}

impl<'a, T> Serialize for DataVar<'a, T>
where
    T: Serialize,
{
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        self.data.serialize(serializer)
    }
}

impl<'a, T> DataVar<'a, T>
where
    T: Serialize,
{
    pub fn save(&self, format: Format) {
        match format {
            Format::JsonCompact => {
                println!("{}", serde_json::to_string(&self).unwrap());
            }
            Format::JsonPretty => {
                println!("{}", serde_json::to_string_pretty(&self).unwrap());
            }
            Format::Csv => {
                let mut wtr = csv::Writer::from_writer(io::stdout());
                wtr.serialize(&self).unwrap();
                wtr.flush().unwrap();
            }
        }
    }
}

The 1D array is serialized to CSV as a row vector. Actually IMHO it would be better to serialize it as a column vector because 1D arrays are typically different observations of the same variable, and in a tidy data set each different observation of a variable should be in a different row, but I can easily fix that myself. See: https://en.wikipedia.org/wiki/Tidy_data https://www.jstatsoft.org/article/view/v059i10 https://jules32.github.io/2016-07-12-Oxford/dplyr_tidyr/ http://garrettgman.github.io/tidying/

Similarly, it is very simple to serialize a DataSet to JSON, but not to CSV:

impl<'a> Serialize for DataSet<'a> {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: Serializer,
    {
        self.datavars.serialize(serializer)
    }
}

impl<'a> DataSet<'a> {
    pub fn save(&self, format: Format) {
        match format {
            Format::JsonCompact => {
                println!("{}", serde_json::to_string(&self).unwrap());
            }
            Format::JsonPretty => {
                println!("{}", serde_json::to_string_pretty(&self).unwrap());
            }
            Format::Csv => {
                let mut wtr = csv::Writer::from_writer(io::stdout());
                wtr.serialize(&self).unwrap();
                wtr.flush().unwrap();
            }
        }
    }
}

This code does not work because maps cannot be serialized to CSV. The proposal specification above would make it work producing results equivalent to JSON, and according to the tidy data conventions.

lucatrv avatar Feb 01 '19 21:02 lucatrv

For my use case, I came up with the following implementation:

let mut wtr = csv::Writer::from_writer(io::stdout());
let keys: Vec<_> = self.datavars.keys().collect();
let datavars: Vec<_> = self
    .datavars
    .values()
    .map(|v| v.as_array().unwrap())
    .collect();
wtr.write_record(&keys).unwrap();
for i in 0..datavars[0].len() {
    let row: Vec<_> = datavars.iter().map(|v| v[i].clone()).collect();
    wtr.serialize(&row).unwrap();
}
wtr.flush().unwrap();

Could you please advise if it could be made more efficient? Thanks

lucatrv avatar Feb 03 '19 13:02 lucatrv

I also got hit by this issue while attempting to use #[serde(flatten)] to reduce code duplication

fmorency avatar May 14 '19 20:05 fmorency

What do you think of just supporting a top-level map? This would be the same as how nested structs don't work but top-level structs do. This would address the #[serde(flatten)] issue that @fmorency pointed out. (I also ran into the flatten issue today.)

sunjay avatar May 24 '19 15:05 sunjay

Another option would be to always assume nested structs to be flattened on output. As CSV is a flat format, there's no point supporting any other representation of nested structs. The only downside is inability to configure prefixing with standard serde attributes.

I have a PoC working. It would solve the issue I, @fmorency and @Kampfkarren are having.

P.S.: It has nothing to do with this exact issue, though. I'm only talking about enabling a frequent use case here.

ilya-epifanov avatar Apr 14 '20 22:04 ilya-epifanov

I support this feature request. Here is my use case:

I'm trying to parse binlog files and write row changes to csv using rust_mysql_common. Currently there is no customized serialization for BinlogRow. So I implemented the following on my own:

/// Customized serialized row for a vector of binlog values
pub struct SerializedRow(pub BinlogRow);

impl Serialize for SerializedRow {
    fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
    where
        S: serde::Serializer,
    {
        let mut map = serializer.serialize_map(Some(self.0.len()))?;
        let header_iter = self.0.columns_ref().iter().map(|col| col.name_str());

        for (header, e) in header_iter.zip(self.0.to_owned().unwrap()) {
            if let BinlogValue::Value(v) = e {
                match v {
                    Value::NULL => map.serialize_entry(header.as_ref(), "NULL")?,
                    Value::Bytes(bytes) => {
                        let utf8_str = unsafe { str::from_utf8_unchecked(&bytes) };
                        map.serialize_entry(header.as_ref(), utf8_str)?
                    }
                    Value::Int(x) => map.serialize_entry(header.as_ref(), &x)?,
                    Value::UInt(x) => map.serialize_entry(header.as_ref(), &x)?,
                    Value::Float(x) => map.serialize_entry(header.as_ref(), &x)?,
                    Value::Double(x) => map.serialize_entry(header.as_ref(), &x)?,
                    Value::Date(year, month, day, hour, minute, second, micro_second) => {
                        let date_str = format!(
                            "{}-{}-{} {}:{}:{}:{}",
                            year, month, day, hour, minute, second, micro_second
                        );
                        map.serialize_entry(header.as_ref(), date_str.as_str())?
                    }
                    Value::Time(neg, day, hour, minutes, second, micro_second) => {
                        let time_str = if !neg {
                            format!("{}:{}:{}:{}:{},", day, hour, minutes, second, micro_second)
                        } else {
                            format!("-{}:{}:{}:{}:{},", day, hour, minutes, second, micro_second)
                        };
                        map.serialize_entry(header.as_ref(), time_str.as_str())?
                    }
                }
            }
        }
        map.end()
    }
}

Ideally when I am writting the SerializedRow to a csv file, I woud just do this:

let mut writer = csv::Writer::from_path("dataset.csv");
/* parsing binlog */
....
for row in rows_event.rows(table_map_event) {
    let (_, mut after) = row.unwrap();
    if after.is_some() {
        let row = after.take().unwrap();
        let srow = serialize::SerializedRow(row);
        // print!("{}", serialized);
        writer.serialize(srow)?;
    }
....
writer.flush()?;

OscarTHZhang avatar Jun 23 '21 02:06 OscarTHZhang