rust-csv
rust-csv copied to clipboard
Support serializing of maps
I want to use this library in order to import/export csv files to/from a key-value database, I use HashMap<String, MyValueType> to store this data. This works fine for deserializing, however I just discovered that it was not supported for serializing, something which feels like an inconsistency in the library. This is a feature that I would appreciate.
I don't think this is a matter of will, but rather, a missing specification. If someone wants this feature, then they need to propose how it should work, in detail. A principle question that must be answered is the ordering of columns (which ordering?) and whether or not it should be deterministic (and if so, how).
(IMO, I think a valid answer to this question is "the library doesn't support it and callers should convert their data type to something else that is already supported." If you're creating a hash map for every record, then you've already given up on making it fast, so the cost of an extra conversion step is just a little more code in the form of a small helper funciton.)
You are probably right, I just realized that this was just the push I needed to actually implement a kind of sorted hashmap so that I can get the csv out in the same format that I inserted it in. However it still seems like an inconsistency with support only for deserializing of maps. I don't think that serialization of at least hashmaps has to be deterministic, when you first choose to use them you are probably no longer interested in ordering.
I don't really buy inconsistency arguments just for the sake of consistency, particularly when there are complications. Reading into a hash map and writing from a hash map have fundamentally different concerns when it comes to CSV data.
I also don't really agree about determinism. Non-determinism is often surprising when you hit it, and while it is appropriate for HashMap
to have arbitrary order, I wouldn't want to propagate that out further.
The reason why I think the status quo is a reasonable option is because it allows the library to not make a choice about determinism and ordering, and forces the caller to do it.
With that said, here are other thoughts:
- It seems like
BTreeMap
should be allowed, since the ordering is made clear by the type. - It may be worthwhile to attempt to pop up a level and focus on the problem you're trying to solve. If the problem is, "I want a simple way to read CSV data without caring much about its structure, but still retain the ability to inspect it dynamically via header field names," (which I'm assuming is what you're doing if you're using a
HashMap
), then there are other approaches here. For example, we might consider a new record type that retains ordering information and exposes a map-like API.
I think that serializing maps may be required in order to work with #[serde(flatten)] attribute.
@synek317 It is, which leads me to going on this issue. I need to have flattened stuff which should work completely fine in a CSV format.
I don't see why serializing maps is required in order for serde(flatten) to work. Can you provide an example?
https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=8312d6a3ee38bdaed677865367405830
Removing the baz: Baz
bit serializes fine.
Interesting. I didn't expect that. Thanks.
On Sun, Nov 18, 2018, 01:10 Kampfkarren <[email protected] wrote:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2015&gist=8312d6a3ee38bdaed677865367405830
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BurntSushi/rust-csv/issues/98#issuecomment-439670448, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb34pBCiPuqjP9lfAqtIMm609ZXXKhVks5uwPnUgaJpZM4QjLWL .
Would you know how to solve it? It's a complete blocker of this right now for me 😕
https://github.com/Kampfkarren/rust-csv/commit/95cea44329d97179ba998a2c30285eeda9c84ac2
This makes flattened maps work, but probably gives weird behavior for unflattened maps which is why I'm not PRing it.
@Kampfkarren I don't know off the top of my head. I'm not familiar with how serde(flatten)
works. I would have thought that it would just generate more code at compile time as if everything was a struct instead of using something dynamic like a map. Moreover, I'm fairly certain that serde(flatten)
worked in the csv crate at one point in time, so I'd want to do a bit of code archaeology to figure out what went wrong.
It's a complete blocker of this right now for me
I don't understand this either. serde(flatten)
is I'm sure very useful, but AIUI, it's still just a convenience.
This makes flattened maps work, but probably gives weird behavior for unflattened maps which is why I'm not PRing it.
Right. If making serde(flatten)
work requires supporting arbitrary maps work, then we need to nail down a specification for maps first.
cc @dtolnay --- Do you have any insights on changes in how serde(flatten)
works? I could have sworn it work with the CSV crate at one point in time, but now it seems to require the serializer to support maps? Maybe I'm misremembering and only tested serde(flatten)
with deserialization though.
We haven't changed how serde(flatten)
works.
As a workaround for https://github.com/BurntSushi/rust-csv/issues/98#issuecomment-439670448 you can provide the Serialize impl yourself.
I support this feature request, so I report here my proposal specification for serialization of structs and maps.
In order for a struct or map to be serializable to CSV, all value fields should either be scalars or equal length 1D arrays. If this is verified, the CSV output should be modeled after the JSON output, with the additional possibility to specify either with or without header line. The output order should follow the standard iteration order, so output for maps would be unordered unless an ordered map type is used (such as IndexMap
).
Here is an example using JSON:
#[derive(Serialize)]
struct MyStruct {
a: Vec<i32>,
b: Vec<i32>,
c: Vec<i32>,
}
let s = MyStruct {
a: vec![1, 2, 3],
b: vec![4, 5, 6],
c: vec![7, 8, 9],
};
let mut m = IndexMap::with_capacity(3);
m.insert("a", vec![1, 2, 3]);
m.insert("b", vec![4, 5, 6]);
m.insert("c", vec![7, 8, 9]);
println!("{}", serde_json::to_string(&s).unwrap());
println!("{}", serde_json::to_string(&m).unwrap());
Note: enable feature "serde-1" of the indexmap
crate for this to work.
This is the JSON output (ordered because I used IndexMap
, otherwise the map output would be unordered):
{"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]}
{"a":[1,2,3],"b":[4,5,6],"c":[7,8,9]}
According to this proposal specification, the corresponding CSV output should be:
a,b,c
1,4,7
2,5,8
3,6,9
a,b,c
1,4,7
2,5,8
3,6,9
Here is my use case description. I'm implementing a plotting library (which I plan to publish on crates.io
) which relies on the following data model:
use serde::{Serialize, Serializer};
use serde_json::{Map, Value};
#[derive(Debug, PartialEq, Clone, Copy, Default)]
struct DataVar<'a, T>
where
T: Serialize,
{
name: &'a str,
data: &'a [T],
}
impl<'a, T> DataVar<'a, T>
where
T: Serialize,
{
pub fn new(name: &'a str, data: &'a [T]) -> Self {
Self { name, data }
}
}
#[derive(Debug, PartialEq, Clone, Default)]
struct DataSet<'a> {
name: &'a str,
datavars: Map<String, Value>,
}
macro_rules! dataset {
($name:expr, $($datavar:expr),*) => {
{
let mut datavars = Map::new();
$(
datavars.insert($datavar.name.to_string(), serde_json::to_value($datavar.data).unwrap());
)*
DataSet{
name: $name,
datavars,
}
}
};
}
I use the serde_json::Value
type for convenience, in order to cast different &[T] where T: Serialize
types to the same map. The serde_json
crate is loaded with feature "preserve_order", so maps are ordered.
The user defines new DataVar
s and DataSet
s as follows:
let x = DataVar::new("X", &[1., 2., 3.]);
let y = DataVar::new("Y", &[true, false, true]);
let z = DataVar::new("Z", &["a", "b", "c"]);
let ds = dataset!("DS", x, y, z);
DataVar
s can easily be serialized to a format supported by serde
and saved as follows (in this example they are serialized to JSON and CSV and printed to screen):
#[derive(Debug)]
pub enum Format {
JsonCompact,
JsonPretty,
Csv,
}
impl<'a, T> Serialize for DataVar<'a, T>
where
T: Serialize,
{
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
self.data.serialize(serializer)
}
}
impl<'a, T> DataVar<'a, T>
where
T: Serialize,
{
pub fn save(&self, format: Format) {
match format {
Format::JsonCompact => {
println!("{}", serde_json::to_string(&self).unwrap());
}
Format::JsonPretty => {
println!("{}", serde_json::to_string_pretty(&self).unwrap());
}
Format::Csv => {
let mut wtr = csv::Writer::from_writer(io::stdout());
wtr.serialize(&self).unwrap();
wtr.flush().unwrap();
}
}
}
}
The 1D array is serialized to CSV as a row vector. Actually IMHO it would be better to serialize it as a column vector because 1D arrays are typically different observations of the same variable, and in a tidy data set each different observation of a variable should be in a different row, but I can easily fix that myself. See: https://en.wikipedia.org/wiki/Tidy_data https://www.jstatsoft.org/article/view/v059i10 https://jules32.github.io/2016-07-12-Oxford/dplyr_tidyr/ http://garrettgman.github.io/tidying/
Similarly, it is very simple to serialize a DataSet
to JSON, but not to CSV:
impl<'a> Serialize for DataSet<'a> {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: Serializer,
{
self.datavars.serialize(serializer)
}
}
impl<'a> DataSet<'a> {
pub fn save(&self, format: Format) {
match format {
Format::JsonCompact => {
println!("{}", serde_json::to_string(&self).unwrap());
}
Format::JsonPretty => {
println!("{}", serde_json::to_string_pretty(&self).unwrap());
}
Format::Csv => {
let mut wtr = csv::Writer::from_writer(io::stdout());
wtr.serialize(&self).unwrap();
wtr.flush().unwrap();
}
}
}
}
This code does not work because maps cannot be serialized to CSV. The proposal specification above would make it work producing results equivalent to JSON, and according to the tidy data conventions.
For my use case, I came up with the following implementation:
let mut wtr = csv::Writer::from_writer(io::stdout());
let keys: Vec<_> = self.datavars.keys().collect();
let datavars: Vec<_> = self
.datavars
.values()
.map(|v| v.as_array().unwrap())
.collect();
wtr.write_record(&keys).unwrap();
for i in 0..datavars[0].len() {
let row: Vec<_> = datavars.iter().map(|v| v[i].clone()).collect();
wtr.serialize(&row).unwrap();
}
wtr.flush().unwrap();
Could you please advise if it could be made more efficient? Thanks
I also got hit by this issue while attempting to use #[serde(flatten)]
to reduce code duplication
What do you think of just supporting a top-level map? This would be the same as how nested structs don't work but top-level structs do. This would address the #[serde(flatten)]
issue that @fmorency pointed out. (I also ran into the flatten
issue today.)
Another option would be to always assume nested structs to be flattened on output. As CSV is a flat format, there's no point supporting any other representation of nested structs. The only downside is inability to configure prefixing with standard serde attributes.
I have a PoC working. It would solve the issue I, @fmorency and @Kampfkarren are having.
P.S.: It has nothing to do with this exact issue, though. I'm only talking about enabling a frequent use case here.
I support this feature request. Here is my use case:
I'm trying to parse binlog files and write row changes to csv using rust_mysql_common. Currently there is no customized serialization for BinlogRow
. So I implemented the following on my own:
/// Customized serialized row for a vector of binlog values
pub struct SerializedRow(pub BinlogRow);
impl Serialize for SerializedRow {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
let mut map = serializer.serialize_map(Some(self.0.len()))?;
let header_iter = self.0.columns_ref().iter().map(|col| col.name_str());
for (header, e) in header_iter.zip(self.0.to_owned().unwrap()) {
if let BinlogValue::Value(v) = e {
match v {
Value::NULL => map.serialize_entry(header.as_ref(), "NULL")?,
Value::Bytes(bytes) => {
let utf8_str = unsafe { str::from_utf8_unchecked(&bytes) };
map.serialize_entry(header.as_ref(), utf8_str)?
}
Value::Int(x) => map.serialize_entry(header.as_ref(), &x)?,
Value::UInt(x) => map.serialize_entry(header.as_ref(), &x)?,
Value::Float(x) => map.serialize_entry(header.as_ref(), &x)?,
Value::Double(x) => map.serialize_entry(header.as_ref(), &x)?,
Value::Date(year, month, day, hour, minute, second, micro_second) => {
let date_str = format!(
"{}-{}-{} {}:{}:{}:{}",
year, month, day, hour, minute, second, micro_second
);
map.serialize_entry(header.as_ref(), date_str.as_str())?
}
Value::Time(neg, day, hour, minutes, second, micro_second) => {
let time_str = if !neg {
format!("{}:{}:{}:{}:{},", day, hour, minutes, second, micro_second)
} else {
format!("-{}:{}:{}:{}:{},", day, hour, minutes, second, micro_second)
};
map.serialize_entry(header.as_ref(), time_str.as_str())?
}
}
}
}
map.end()
}
}
Ideally when I am writting the SerializedRow
to a csv file, I woud just do this:
let mut writer = csv::Writer::from_path("dataset.csv");
/* parsing binlog */
....
for row in rows_event.rows(table_map_event) {
let (_, mut after) = row.unwrap();
if after.is_some() {
let row = after.take().unwrap();
let srow = serialize::SerializedRow(row);
// print!("{}", serialized);
writer.serialize(srow)?;
}
....
writer.flush()?;