rust-csv icon indicating copy to clipboard operation
rust-csv copied to clipboard

Request: Reader::clear_headers()

Open abonander opened this issue 6 years ago • 3 comments

Version: 1.1.1

At work I have a task to parse a CSV file containing two tables one after another with different field names and cardinality (I don't have any control over this format):

# date the file was generated, only ever one row
year,month,day,hour.minute,second
2019,07,19,07,58,17
# proprietary data, arbitrary number of rows
foo,bar,baz,quxx
0,0,0,0
1,1,1,1
# ... etc

I decided to parse it into two different structs using Serde and combine them manually:

#[derive(Debug, Deserialize, PartialEq, Eq)]
pub struct FileDate {
    pub year: u16,
    pub month: u8,
    pub day: u8,
    pub hour: u8,
    pub minute: u8,
    pub second: u8,
}

#[derive(Debug, Deserialize, PartialEq, Eq)]
pub struct ProprietaryData {
    pub foo: u64,
    pub bar: u64,
    pub baz: u64,
    pub quux: u64,
}

pub struct DataWithDate {
    date: FileDate,
    data: ProprietaryData,
}

My first instinct was to use two separate readers:

pub fn parse<P: AsRef<Path>>(path: P) -> Result<DataWithDate, failure::Error> {
    let mut file = File::open(path)?; 

    let date =ReaderBuilder::new()
        .from_reader(&mut file)
        .into_deserialize::<FileDate>()
        .next()
        .ok_or_else(|| failure::err_msg("did not find FileDate in balance CSV"))??;

    // then read the rest of the data
    let data = ReaderBuilder::new()
        .from_reader(&mut file)
        .into_deserialize()
        .collect::<Result<Vec<ProprietaryData>, csv::Error>>()?;

    Ok(DataWithDate { date, data })
}

However, I found that didn't work because Reader uses BufReader internally and so the first one is throwing away required data when it's dropped.

I then figured I could use the same reader and replace the headers, but it required setting .flexible(true) which weakens validation:

pub fn parse<P: AsRef<Path>>(path: P) -> Result<DataWithDate, failure::Error> {
    let mut reader = ReaderBuilder::new()
        .flexible(true)
        .from_path(path)?; 

    let date = reader
        .deserialize::<FileDate>()
        .next()
        .ok_or_else(|| failure::err_msg("did not find FileDate in balance CSV"))??;

    // read the headers for the data and set it
    let mut data_header = StringRecord::new();

    // this requires setting `.flexible(true)` otherwise it will return an error about the cardinality of the headers row
    if !reader.read_record(&mut data_header)? {
        return Err(failure::err_msg("failed to read data header"));
    }

    reader.set_headers(data_header);

    // then read the rest of the data
    let data = reader
        .into_deserialize()
        .collect::<Result<Vec<ProprietaryData>, csv::Error>>()?;

    Ok(DataWithDate { date, data })
}

I would like a Reader::clear_headers() method that makes it behave as if it was at the beginning of a new table:

pub fn parse<P: AsRef<Path>>(path: P) -> Result<DataWithDate, failure::Error> {
    let mut reader = ReaderBuilder::new()
        .flexible(true)
        .from_path(path)?; 

    let date = reader
        .deserialize::<FileDate>()
        .next()
        .ok_or_else(|| failure::err_msg("did not find FileDate in balance CSV"))??;

    // clear the headers so it reads the new ones
    reader.clear_headers();

    // then read the rest of the data
    let data = reader
        .into_deserialize()
        .collect::<Result<Vec<ProprietaryData>, csv::Error>>()?;

    Ok(DataWithDate { date, data })
}

I don't really like having to set .flexible(true) because it's only necessary for reading that second headers row; I'd prefer to be more robust against format changes and mistakes (which is the same reason I don't want to set .has_headers(false)).

abonander avatar Jul 26 '19 02:07 abonander

I expect this will be a simple addition as it should just involve setting self.state.headers = None.

abonander avatar Jul 26 '19 02:07 abonander

This sounds somewhat reasonable to me, although, I think if we think about this more holistically, then we can solve your flexible(true) issue as well. In particular, if we think about this as "reset the reader" rather than "clear the headers," then the semantics of resetting the reader could be, "clear all state associated with the reader, and behave as if the next read is the first record of another CSV data set." That way, the flexible state would justifiably get reset as well. Its use case is also very clear: "this is useful when there are multiple CSV data sets concatenated in a single stream."

With that said, one caveat here is that using this routine seems to imply that you know when your first CSV data set ends. It's easy in your case, since it's only a single record. But you might not know this in general, and instead, might only know it when you read the header row of the next CSV data set. Getting that right seems trickier. Neither my idea nor your idea really covers that. I suppose that probably needs a "peek" operation, which could then be followed by a reset-like call, I think. Does that make sense?

BurntSushi avatar Jul 26 '19 11:07 BurntSushi

You wouldn't need to reset, just not consume the data from the buffer. Perhaps a Peek wrapper which issues the consume when dropped or explicitly told to (if converted to an owned record or used to deserialize). The "peek" operation would also need to skip validation which would also support my workaround without requiring .flexible(true).

I don't know if resetting the reader should reset the flexible flag though; I just don't want to have to set it to true in the first place. How I understand it, if the reader just forgot the headers while has_headers remains true then it would read the next row as the header of a new table.

abonander avatar Jul 26 '19 18:07 abonander