rust-csv
rust-csv copied to clipboard
Confusing relationship between `has_headers` and `set_headers`
What version of the csv
crate are you using?
csv = "1.1.6"
Briefly describe the question, bug or feature request.
The header line is parsed as the rest of the lines, when it should not be parsed. This results in failure when the column names are parsed and validated with serde.
Include a complete program demonstrating a problem.
I need to use the following workaround to make it work:
pub fn my_deser<'de, D>(deserializer: D) -> Result<MyFieldType, D::Error>
where
D: serde::Deserializer<'de>,
{
let raw: String = serde::Deserialize::deserialize(deserializer)?;
println!("{}", &raw);
Ok(MyFieldType::create_instance())
}
#[derive(Serialize, Deserialize, Default, Debug)]
struct MyEntry {
#[serde(deserialize_with = "my_deser", alias = "aaaa")]
meaningful_name: MyFieldType,
}
fn parse_csv(bytes_reader: &[u8]) -> anyhow::Result<Vec<MyEntry>> {
// Skip the first line containing the headers to workaround the CSV reader
// failing to ignore the first line despite has_headers(true).
let first_newline = bytes_reader.iter().position(|&c| c == b'\n').unwrap();
let bytes_reader: &[u8] = &bytes_reader[first_newline..];
let mut reader = ReaderBuilder::new()
.has_headers(true)
.flexible(true)
.delimiter(b',')
.quoting(true)
.quote(b'"')
.trim(Trim::All)
.from_reader(bytes_reader);
reader.set_headers(csv::StringRecord::from(
&[
// Note we can't use a meaningful name here because this column name
// is parsed as a value.
"aaaa",
][..],
));
reader
.into_deserialize()
.map(|e| match e {
Ok(entry) => Ok(entry),
Err(err) => Err(anyhow::anyhow!("Failed deserializing row: {:#?}", err)),
})
.collect::<Result<Vec<MyEntry>, _>>()
}
This is the CSV:
"name"
"000000"
Does not matter that has_headers
is set to true
or false
, seems broken.
Also, note if I'm not manually skipping the first line, "name"
is also parsed despite using set_headers()
and despite has_headers(true)
, seems a different issue?
Please provide a complete program, along with the input, that I can compile and build that reproduces your problem.
The example below shows the issue. Notice the parsing generates two items instead of just one, despite has_headers(true)
.
The question is what should happen when both has_headers
and set_headers
are called. Currently in this case the first line of the CSV is parsed as content. But I would argue that it should be ignored instead:
-
has_headers
should mean that the first line of the CSV has headers and it's NOT data. -
set_headers
should set or overwrite the names of the columns, without any implication about the first line of the CSV.
Two examples of bogus headers we're dealing with, where this behavior would help:
000000 27.03.2022 03:05:01
480C43 T-057 A332 Airbus A330-200MRTT
"","","","","","","","","","","","","","",""
"480c43","T-057","Airbus","","A330-243 MRTT","A330","","","","Nato Multinational Mrtt Fleet","Multi","MMF","","",""
Cargo.toml:
[package]
name = "issue"
version = "0.1.0"
edition = "2021"
[dependencies]
csv = "1.1.6"
serde = { version = "1.0", features = ["derive"] }
anyhow = "1.0.42"
main.rs:
use csv::ReaderBuilder;
use serde::Deserialize;
#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct AircraftMetadata {
pub icao: String,
pub tail: String,
pub type_designator: String,
pub model: Vec<String>,
}
fn parse_mictronics_aircraft_database(bytes_reader: &[u8]) -> anyhow::Result<Vec<AircraftMetadata>> {
let mut reader = ReaderBuilder::new()
.has_headers(true)
.flexible(true)
.delimiter(b'\t')
.from_reader(bytes_reader);
reader.set_headers(csv::StringRecord::from(
&["icao", "tail", "type_designator", "model"][..],
));
reader
.into_deserialize()
.map(|e| match e {
Ok(entry) => Ok(entry),
Err(err) => Err(anyhow::anyhow!("Failed deserializing row: {:#?}", err)),
})
.collect::<Result<Vec<AircraftMetadata>, _>>()
}
fn main() {
let csv = "000000 27.03.2022 03:05:01
480C43 T-057 A332 Airbus A330-200MRTT";
let actual = parse_mictronics_aircraft_database(csv.as_bytes()).unwrap();
let expected = vec![
AircraftMetadata {
icao: "000000".to_string(),
tail: "27.03.2022".to_string(),
type_designator: "03:05:01".to_string(),
model: vec![],
},
AircraftMetadata {
icao: "480C43".to_string(),
tail: "T-057".to_string(),
type_designator: "A332".to_string(),
model: vec!["Airbus".to_string(), "A330-200MRTT".to_string()],
},
];
assert_eq!(actual.len(), expected.len());
actual.into_iter().zip(expected).for_each(|(a, b)| {
assert_eq!(a, b);
});
}
I can't make heads or tails of the problem you're facing, sorry. The program you've given me runs just fine. So what's the issue here exactly?
Yes, please remove the first element of the let expected = vec![
to see how it should work but fails
Ah I see. Yeah, this is documented behavior. As the docs for set_headers
state:
Any automatic detection of headers is disabled.
The issue here is that your CSV file does have headers. You just don't want them. You want to set your own. So you should set has_headers(true)
and then read and throw away the headers found in the file. Like this:
let mut reader = ReaderBuilder::new()
.has_headers(true)
.flexible(true)
.delimiter(b'\t')
.from_reader(bytes_reader);
// Read and throw away the headers:
reader.headers()?;
reader.set_headers(csv::StringRecord::from(
&["icao", "tail", "type_designator", "model"][..],
));
Then your program works, with the first record omitted the assertions below pass.
Thinking on this a bit more, I think I understand the issue here. Namely, that if has_headers(true)
, then the first record should always be parsed as headers even if set_headers
has been called. Setting aside the fact that the existing behavior is documented and changing it would be a breaking change, the main issue with it is that set_headers
is meant to override header detection. So if set_headers
didn't prevent the reader from reading headers, then what is it supposed to do with the headers once it reads them? I think the implication of your request here is that it reads them and then throws them away. I think it's better if that's explicit, which is what the current API requires you to do. And it's what the code I gave above does.
Maybe this deserves to be an example in the docs.
Thanks for taking the time to look into this!
Yes, the implication of my request is that it skips the first row when both has_headers(true)
and set_headers()
are called.
In the example above when has_headers(true)
and set_headers(...)
are called, both rows are parsed as data. But one gets an error if has_headers(false)
is called instead of has_headers(true)
. The behavior is inconsistent since in both cases the first row is read as a data row, but in one case it succeeds and in the other it fails?
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Failed deserializing row: Error(
Deserialize {
pos: Some(
Position {
byte: 0,
line: 1,
record: 0,
},
),
err: DeserializeError {
field: None,
kind: Message(
"invalid length 3, expected struct AircraftMetadata with 4 elements",
),
},
},
)', src/main.rs:34:69
Because your header row doesn't fit into your serde type. There is no inconsistency here. In one case, you're trying to deserialize it into your type. In the other case, you're just reading it as a plain string record.
Why doesn't the code i have you work for you?
The issue I see boils down to the weirdness between has_headers
and set_headers
.
For example:
use csv::ReaderBuilder;
use serde::Deserialize;
#[derive(Clone, Debug, Deserialize, PartialEq)]
pub struct AircraftMetadata {
pub icao: String,
pub tail: String,
}
fn parse_mictronics_aircraft_database(has: bool, set: bool) -> Result<Vec<AircraftMetadata>, csv::Error> {
println!();
print!("has_header({})", has);
let csv = "4,T";
let mut reader = ReaderBuilder::new()
.has_headers(has)
.flexible(true)
.delimiter(b',')
.from_reader(csv.as_bytes());
if set {
reader.set_headers(csv::StringRecord::from(
&["icao", "tail"][..],
));
print!(" set_headers(...)");
}
println!();
reader
.into_deserialize()
.collect::<Result<Vec<AircraftMetadata>, _>>()
}
fn main() {
println!("{:#?}", parse_mictronics_aircraft_database(false, false)); // 1
println!("{:#?}", parse_mictronics_aircraft_database(true, false)); // 0
println!("{:#?}", parse_mictronics_aircraft_database(false, true)); // 2
println!("{:#?}", parse_mictronics_aircraft_database(true, true)); // 1
}
The output makes sense when has_headers(false)
(first two entries) but it can be confusing when has_headers(true)
.
Notice the third result has_header(false) set_headers(...)
which has two entries! It seems the csv library shoves the forced headers at the start of the data and thus when has_headers(false)
they are parsed as values.
$ cargo run 2>/dev/null
has_header(false)
Ok(
[
AircraftMetadata {
icao: "4",
tail: "T",
},
],
)
has_header(true)
Ok(
[],
)
has_header(false) set_headers(...)
Ok(
[
AircraftMetadata {
icao: "icao",
tail: "tail",
},
AircraftMetadata {
icao: "4",
tail: "T",
},
],
)
has_header(true) set_headers(...)
Ok(
[
AircraftMetadata {
icao: "4",
tail: "T",
},
],
)
Mentioning the _ = reader.headers()
trick in the set_headers
documentation would be good!
Maybe assert in set_headers
that has_headers
is true
?
The output when let csv = "4";
(one column instead of two):
has_header(false)
Err(
Error(
Deserialize {
pos: Some(
Position {
byte: 0,
line: 1,
record: 0,
},
),
err: DeserializeError {
field: None,
kind: Message(
"invalid length 1, expected struct AircraftMetadata with 2 elements",
),
},
},
),
)
has_header(true)
Ok(
[],
)
has_header(false) set_headers()
Err(
Error(
Deserialize {
pos: Some(
Position {
byte: 0,
line: 1,
record: 0,
},
),
err: DeserializeError {
field: None,
kind: Message(
"invalid length 1, expected struct AircraftMetadata with 2 elements",
),
},
},
),
)
has_header(true) set_headers()
Err(
Error(
Deserialize {
pos: Some(
Position {
byte: 0,
line: 1,
record: 0,
},
),
err: DeserializeError {
field: None,
kind: UnexpectedEndOfRow,
},
},
),
)
The errors make sense except the last one: "UnexpectedEndOfRow". Mentioning as it might be a potential source of confusion.
Maybe assert in
set_headers
thathas_headers
istrue
?
This is a breaking change.
I agree that it is weird to call set_headers
when has_headers
is set to false
. The behavior is consistent though. The reader is yielding any detected header row because the caller might call the headers
method, which works regardless of whether has_headers
is enabled or not.
I think the fundamental confusion here is that you're trying to form a connection between has_headers
and set_headers
. But there isn't one. The connection is between set_headers
and headers
.
But yeah, if you're setting has_headers
to false
, then it is exceptionally weird to be calling either set_headers
or headers
. My code I gave above does call headers
, but this is equivalent to just calling read_record
once and throwing away the result. In effect, you're skipping over the first record.
I'll mark this as a doc
bug.
Thanks for considering to improve the documentation.
When using custom deserialization for custom field types it can be super confusing seeing the headers set with set_headers
being attempted to be deserialized as a row.
How bad would it be to introduce a breaking change? This usage does not make any sense at all. :) Users can select any release they want if they prefer the past functionality.
This confusion does not come anywhere remotely close to being bad enough to justify a breaking change or a csv 2.0
release. Sorry.
You've made it clear this behavior is confusing to you and that you don't like it. I understand. Let's improve the docs and move on. If and when there's ever a csv 2
, we can try to revisit this behavior, perhaps with something as simple as an assert
.
This usage does not make any sense at all.
It makes sense just fine if you consider that using set_headers
when has_headers
is false is "just" an unchecked illegal use of the API. Don't use set_headers
when has_headers
is false from now on and it looks like you should be in the clear.
I'm just curious how you think.
Ok, good luck!