rust-csv icon indicating copy to clipboard operation
rust-csv copied to clipboard

Write headers with serde when the collection is empty

Open woshilapin opened this issue 5 years ago • 8 comments

On csv 1.1.1

When using serde to serialize a CSV file, the header is not written in the file if Writer::serialize() has not been called. This might seems logical but here is my problem: I'm using a standard format based on CSV which force us to write a file with the headers, even if there is no data associated.

use csv::*;
use serde::*;

#[derive(Serialize, Deserialize)]
struct Object {
    id: u32,
    name: &'static str,
}

fn main() {
    let object = Object {
        id: 1,
        name: "some name",
    };
    let objects: Vec<Object> = vec![object]; // XXXXX
    let mut writer = WriterBuilder::default()
        .has_headers(true)
        .from_path("foo.csv")
        .unwrap();
    for o in objects {
        writer.serialize(o).unwrap();
    }
}

The above program will output the expected file with the header.

id,name
1,some name

But, if you replace the line annotated with XXXXX with

    let objects: Vec<Object> = vec![];

The resulting file will end up completely empty but I'd like to be able to at least produce.

id,name

Obviously, if I never tell the Writer object to do anything, nothing will happen. But I was thinking maybe a Writer::serialize_header() could be provided? In the above piece of code, I could call writer.serialize_header() just above the for loop.

Note: I'm not a rockstar of Rust but I'd be fine to try to implement it if you could provide some guidance. But first, I need to be sure I didn't miss a piece of documentation that would do what I want! And also if you're ok about the possibility to add such a new function and the design of the API.

woshilapin avatar Jul 17 '19 12:07 woshilapin

Hmm, it does seem smart to write the headers here, even if no rows are written. I'd rather not introduce an additional method to do this though. I think it would be better to just make flush do this. So if you wanted to submit a PR, then I'd try this:

  1. Refactor this code into an unexported helper method.
  2. Modify the flush method to call that helper method. The call should be right before the let result ... line and after self.state.panicked = true;. This is a little tricky since you'll need to check both results for errors.
  3. Add a test for it.

That way, this should just happen automatically since flush is called on Drop.

BurntSushi avatar Jul 17 '19 12:07 BurntSushi

Thanks for the quick reply and the guidance. And indeed, your proposition makes a lot of sense and will be even better. I'll take a look at the code then.

woshilapin avatar Jul 17 '19 12:07 woshilapin

TL;DR;

Writer::flush needs a record to figure out the headers... and I'm not sure it makes a lot of sense to send a record to Writer::flush().

Long version

So I tried to implement the solution. First I did a dumb function extraction of this function as suggested into a Writer::serialize_header(). Something seems weird at first: the fact that I needed a record to do that (but I decided to not care for now... spoiler, it was a mistake).

impl<W: io::Write> Writer<W> {
    ...
    fn serialize_header<S: Serialize>(&mut self, record: S) -> Result<()> {
        if let HeaderState::Write = self.state.header {
            let wrote_header = serialize_header(self, &record)?;
            if wrote_header {
                self.write_terminator()?;
                self.state.header = HeaderState::DidWrite;
            } else {
                self.state.header = HeaderState::DidNotWrite;
            };
        }
        Ok(())
    }
    ...
}

And then replace the Writer::serialize() function with the following code.

impl<W: io::Write> Writer<W> {
    ...
    pub fn serialize<S: Serialize>(&mut self, record: S) -> Result<()> {
        self.serialize_header(&record)?;
        serialize(self, &record)?;
        self.write_terminator()?;
        Ok(())
    }
    ...
}

This simple refactor didn't break anything so far (all the tests still passed). But I was starting to wonder "how am I gonna get a record to call Writer::serialize_header() in Writer::flush() function?".

I tried to write some test for the use case in question (serialize only headers), I came up with the following code (based on examples/tutorial-write-serde-02.rs).

use std::error::Error;
use std::io;
use std::process;

use serde::Serialize;

// Note that structs can derive both Serialize and Deserialize!
#[derive(Debug, Serialize)]
#[serde(rename_all = "PascalCase")]
struct Record<'a> {
    city: &'a str,
    state: &'a str,
}

fn run() -> Result<(), Box<dyn Error>> {
    let mut wtr = csv::Writer::from_writer(io::stdout());
    wtr.flush()?;
    Ok(())
}

fn main() {
    if let Err(err) = run() {
        println!("{}", err);
        process::exit(1);
    }
}

But then my compiler start to give me some warning about Record not being used... then I realized why a record was needed to write the headers. The Writer doesn't know in advance which struct is going to be serialized. Therefore, I guess it's possible to serialize different types of struct with the same Writer. This poses the question, which headers should we write in the following case.

use csv::*;
use serde::*;

#[derive(Serialize)]
struct Township {
    township_name: &'static str,
}
#[derive(Serialize)]
struct Village {
    village_name: &'static str,
}

fn main() {
    let township = Township {
        township_name: "Northville Township",
    };
    let village = Village {
        village_name: "Milford",
    };
    let mut writer = WriterBuilder::default().from_path("foo.csv").unwrap();
    writer.serialize(village); // (1)
    writer.serialize(township); // (2)
    writer.flush();
}

The above program will output the following CSV

village_name
Milford
Northville Township

But if I swap the .serialize() lines (annotated with (1) and (2)), then the output CSV has different headers.

township_name
Northville Township
Milford

It seems somehow that the Writer should either know in advance which structure is going to be serialized (but that would constraint the flexibility above of having different struct to serialize in the same file), or that the Writer should be explicitly told about which headers to use.

I'm not even sure which direction to take at this point. Did I miss something obvious that was in front of my eyes?

woshilapin avatar Jul 18 '19 14:07 woshilapin

Ah right, sorry about leading you down a wild goose chase here.

I'm not sure exactly what to do here unfortunately. Tying a Writer to a single serialized type is not what I want to do, partially because a Writer isn't just for Serde serializing and partially because it could hinder ergonomics. i.e., You would almost certainly need a type hint somewhere in order to build such a type.

It seems like the simplest solution here is probably to do as you suggested originally: add a new serialize_header method. The one catch is that it needs to accept a record value in order to push the serialization infrastructure along. It will have to be documented that the record value won't be written, but instead, only the headers will.

(but that would constraint the flexibility above of having different struct to serialize in the same file)

Personally, I don't like having that flexibility, although I could imagine it being useful in some contexts. The flexibility is more of a result of me not wanting to complicate the API more by pinning Writer to a particular type. (Or even adding another Writer-like type just for Serde stuff.)

BurntSushi avatar Jul 20 '19 16:07 BurntSushi

No worry for the wild goose chase, it made me learn a bit about csv code so I don't feel like I lost my time :smile:

I'll see what I'll do about implementing that serialize_header function because indeed, the API would be really weird so not yet sure it's worth doing.

woshilapin avatar Jul 22 '19 12:07 woshilapin

@woshilapin Funnily enough, #164 is a good example of where having a Reader or a Writer not tied to a specific type is beneficial. It affords that extra bit of flexibility that lets you handle crazily formatted data.

BurntSushi avatar Jul 26 '19 11:07 BurntSushi

What a nice use case indeed (even if I don't think I understood all the technicalities 😛). Thank you for pointing it out.

woshilapin avatar Jul 26 '19 20:07 woshilapin

For anyone coming here and as long as no solution to this is merged over here, here's a quick write-down of a workaround I found:

https://stackoverflow.com/a/73375434/2352071

dlaehnemann avatar Aug 16 '22 14:08 dlaehnemann