rust-csv Extract a subset of fields, chosen at runtime, by name

What version of the `csv` crate are you using?

1.1.1

Briefly describe your feature request.

I have a program that parses a CSV file with many columns, only some of which are relevant to any given run of the program. Here is a cut-down example CSV file: the real thing has many more rows and also many more columns labeled with uppercase three-character codes.

addr,id,longitude,latitude,ABW,AFG,AGO,AIA,ALA
2.58.52.152,1716,8.6821267,50.1109221,8138590,4481900,6051510,7167090,1309850
3.17.51.141,1545,-82.9987942,39.9611755,3284140,10847630,10792390,3021880,6873370
3.121.238.252,1546,8.6821267,50.1109221,8138590,4481900,6051510,7167090,1309850

On each run of the program, it wants to deserialize addr, longitude, latitude, and one of the columns labeled with a three-character code, into this structure:

struct HostRecord {
    addr: IpAddr,
    longitude: f64,
    latitude: f64,
    distance: f64
}

The catch is that which of the three-character-code columns should be deserialized into the distance field varies at runtime. There doesn't seem to be any way to make Serde do that. There also doesn't seem to be any way to extract a subset of the fields from a StringRecord or BytesRecord by name. The best I have managed to do is scan the headers record manually for each relevant field, make note of their indices, and then manually extract each field by index, as shown below. It's tedious to write and easy to screw up.

Include a complete program demonstrating a problem.

struct CSVColumns {
    addr: usize,
    longitude: usize,
    latitude: usize,
    distance: usize,
}

/// Error type for select_columns.
#[derive(Debug, Fail)]
#[fail(display="missing columns: {}", _0)]
struct MissingColumnsError(String);

fn select_columns(header: &csv::StringRecord, loc: &str)
    -> Result<CSVColumns, MissingColumnsError> {

    let mut addr: Option<usize> = None;
    let mut longitude: Option<usize> = None;
    let mut latitude: Option<usize> = None;
    let mut distance: Option<usize> = None;
    let mut wanted = 4;

    for (index, field) in header.iter().enumerate() {
        match field {
            "addr"        => { addr      = Some(index); wanted -= 1; },
            "longitude"   => { longitude = Some(index); wanted -= 1; },
            "latitude"    => { latitude  = Some(index); wanted -= 1; },
            f if f == loc => { distance  = Some(index); wanted -= 1; },
            _             => {},
        }
        if wanted == 0 { break }
    }

    if wanted == 0 {
        Ok(CSVColumns {
            addr: addr.unwrap(),
            longitude: longitude.unwrap(),
            latitude: latitude.unwrap(),
            distance: distance.unwrap(),
        })

    } else {
        let mut missing: Vec<&str> = Vec::with_capacity(4);
        if addr.is_none()      { missing.push("addr"); }
        if longitude.is_none() { missing.push("longitude"); }
        if latitude.is_none()  { missing.push("latitude"); }
        if distance.is_none()  { missing.push(loc); }

        Err(MissingColumnsError(missing.join(", ")))
    }
}

fn load_hosts(fname: &Path, loc: &str) -> Result<Vec<HostRecord>, Error> {
    use std::str::from_utf8;

    let mut fp = File::open(fname)?;
    let mut rd = csv::ReaderBuilder::new()
        .has_headers(true)
        .trim(csv::Trim::All)
        .from_reader(fp);
    let cols = select_columns(rd.headers()?, loc)?;

    // Don't bother doing UTF-8 validation on the columns we're not
    // interested in.
    let mut row = csv::ByteRecord::new();
    let mut v: Vec<HostRecord> = Vec::new();
    while rd.read_byte_record(&mut row)? {
        v.push(HostRecord {
            addr: from_utf8(&row[cols.addr])?.parse()?,
            longitude: from_utf8(&row[cols.longitude])?.parse()?,
            latitude: from_utf8(&row[cols.latitude])?.parse()?,
            distance: from_utf8(&row[cols.distance])?.parse()?
        });
    }
    Ok(v)
}

What is the expected or desired behavior of the code above?

The above code does work, it's just that there should be a better way to write it. Off the top of my head, a plausible "better way" would be a Reader method find_columns that tells you the indices for a set of column names, allowing me to dispense with select_columns and write something like this instead:

fn load_hosts(fname: &Path, loc: &str) -> Result<Vec<HostRecord>, Error> {
    use std::str::from_utf8;

    let mut fp = File::open(fname)?;
    let mut rd = csv::ReaderBuilder::new()
        .has_headers(true)
        .trim(csv::Trim::All)
        .from_reader(fp);

    // find_columns fails if any columns are missing, or produces a
    // HashMap<&'a str, usize> mapping column names to indices
    let cols = rd.find_columns(&["addr", "latitude", "longitude", loc])?;

    // Don't bother doing UTF-8 validation on the columns we're not
    // interested in.
    let mut row = csv::ByteRecord::new();
    let mut v: Vec<HostRecord> = Vec::new();
    while rd.read_byte_record(&mut row)? {
        v.push(HostRecord {
            addr: from_utf8(&row[cols["addr"]])?.parse()?,
            longitude: from_utf8(&row[cols["longitude"]])?.parse()?,
            latitude: from_utf8(&row[cols["latitude"]])?.parse()?,
            distance: from_utf8(&row[cols[loc]])?.parse()?
        });
    }
    Ok(v)
}

A further improvement would be a way to ask the reader to return only the desired columns from each row, which would both speed up parsing (since only those columns would need to be copied and UTF-8 validated), and enable use of serde again:

fn load_hosts(fname: &Path, loc: &str) -> Result<Vec<HostRecord>, Error> {

    use std::str::from_utf8;

    let mut fp = File::open(fname)?;

    // csv does its own buffering, no need for a BufReader
    let mut rd = csv::ReaderBuilder::new()
        .has_headers(true)
        .trim(csv::Trim::All)
        .from_reader(fp);

    // find_columns fails if any columns are missing, or produces a
    // HashMap<&'a str, usize> mapping column names to indices
    let cols = rd.find_columns(&["addr", "latitude", "longitude", loc])?;

    let mut row = csv::StringRecord::new();
    let mut v: Vec<HostRecord> = Vec::new();
    while rd.read_columns(&mut row, &cols)? {
        v.push(row.deserialize::<HostRecord>()?);
    }
    Ok(v)
}

Aug 15 '19 19:08 zackw

Thanks for the detailed feature request!

So there is a key difference between the code you wrote manually and the "automatic" version supplied by the reader: your code knows how many fields you want, so you can just use a struct. But the library's version of find_columns wouldn't know that. The most straight-forward way to implement that would be with a hash map, which you could also do yourself, and the code would be much simpler.

I am more tempted by your last suggestion, where you tell the reader which columns you want, and it populates the record appropriately. It would still have to use a map internally though, which is going to be slower than your hand-rolled version I imagine. Not sure if it is a meaningful difference though.

I'm overall not completely convinced that this is a good idea. Mostly because I think it's probably fairly niche, but also because it adds some complexity to the reader API. For example how does the flexible option interact with this feature? It also adds another dimension of statefulness to the Reader (which is already very stateful) in that in order to use read_columns the caller must know to call find_columns first.

Aug 15 '19 20:08 BurntSushi

I don't think this (specifically, wanting to process a subset of the columns of a CSV file, whose names are not known at compile time) is that niche: it seems to me it might come up any time the input CSV contains what statistics people call "wide data".

I do appreciate your concerns about adding state to Reader. Would you feel more comfortable with these suggested methods if they used vectors of indices, like this, instead?

impl Reader {
    fn find_named_columns(&self, names: &Vec<&str>) -> Result<Vec<usize>>;
    fn read_columns(&mut self, record: &mut StringRecord, indices: &Vec<usize>) -> Result<bool>
    // also presumably columns(), into_columns(), byte_columns(), etc
}

Then you could use read_columns to extract columns 1, 3, and 6 of a headerless CSV, or you could use find_named_columns to discover that the "addr", "latitude", and "longitude" columns of a headerful CSV are in fact columns 1, 3, and 6, and make whatever use of that information you see fit. No state needs to be added to Reader.

(A tricky bit is that the ordering of the Vec produced by find_named_columns needs to match the ordering of the Vec passed in, not the order of the columns within the file, but read_columns wants the opposite. But it's going to be a short vector either way so I don't think it's a big problem.)

I'm not sure what to do about flexible mode, because what if the missing field(s) are supposed to go in the middle of the StringRecord according to the order of indices? I could live with this mechanism just not working in flexible mode, or with it producing empty strings for the missing fields.

Aug 15 '19 20:08 zackw

I don't think this (specifically, wanting to process a subset of the columns of a CSV file, whose names are not known at compile time) is that niche

No, that isn't, not alone. But combining that need with going as fast as possible is what seems niche to me.

I think I like your API that doesn't rely on making Reader more stateful better.

I'm not sure what to do about flexible mode, because what if the missing field(s) are supposed to go in the middle of the StringRecord according to the order of indices? I could live with this mechanism just not working in flexible mode, or with it producing empty strings for the missing fields.

Yeah, so I guess our choices are either "return an error if indices are invalid" or "always return a record with a number of fields equal to the number of columns requested, where some fields may be empty."

Aug 15 '19 20:08 BurntSushi

I thought about this some more on the way home. The serde integration almost does what I want, already. It selects a subset of the columns to populate a struct with, and I can arbitrarily configure the mapping between CSV column names and struct field names, but that mapping has to be fixed at compile time. The missing piece is being able to say, at runtime, "on this run of the program, the distance field of HostRecord should be deserialized from the AIA column of the CSV."

I looked at the csv 1.1.1 source code and it's not apparent to me how the deserializer logic knows which fields of the CSV to stick into the struct, but it sure seems like a runtime-configurable mapping ought to be possible.

Aug 15 '19 22:08 zackw

The deserializer is unfortunately a big pile of goop. I have to re-read it thoroughly every time I have a question about its behavior unfortunately.

I think the trace here is:

deserialize_struct invokes a map visitor.
The map visitor picks off the next header field from an iterator over all header fields. This iterator is generated for each record deserialized from the header row.

So basically, the map visitor returns the header field back to Serde, and then Serde takes care of wiring that up to the actual struct field. To me that seems like it's pretty hard to override, and that kind of makes sense to me: the logic for mapping field names to struct fields has to be wired in at compile time via derive(Deserialize).

Aug 15 '19 23:08 BurntSushi

Interesting discussion! I came here because I found no example for transforming a de-serialized struct into a slightly different struct (without manually serialize all the way). @zackw mentioned serde can do such a mapping at compile time (which would be sufficient for me) but I haven't figured it out yet (serde is complex... :sweat: ) - a pointer or example would be appreciated.

I'm thinking of a use-case like this:

struct input {
    addr: usize,
    longitude: usize,
    latitude: usize,
    distance: usize,
}
struct output {
    new_attribute: String
    derived_addr: usize,
    // removed attribute:
    distance: usize,
    // unchanged attributes:
    longitude: usize,
    latitude: usize,
}
[...some code for serializing, mapping and de-serializing the above objects...]

EDIT: after searching some more, I found an example using a Writer with a struct in the cookbook. So, sorry for the noise, I'm still a Rust-rookie.

This is great library + outstanding performance, thanks for building it!

Aug 18 '19 10:08 cjk

Extract a subset of fields, chosen at runtime, by name

What version of the csv crate are you using?

Briefly describe your feature request.

Include a complete program demonstrating a problem.

What is the expected or desired behavior of the code above?

What version of the `csv` crate are you using?