rust-csv icon indicating copy to clipboard operation
rust-csv copied to clipboard

Deserializing bool fields fails if the field is not exactly "true" or "false".

Open Frodo45127 opened this issue 7 years ago • 12 comments
trafficstars

What version of the csv crate are you using?

1.0.2

Briefly describe the question, bug or feature request.

When deserializing a TSV file, if a boolean column's values are not exactly true or false, the deserialization fails. Some editors, like Google Sheets use "TRUE" and "FALSE", or "True" and "False" instead, making files edited with them throw an error when deserializing.

Include a complete program demonstrating a problem.

Here is a working example of the problem. The works.tsv file works, but the fails.tsv throws an error in each row. csv-bug.tar.gz

What is the expected or desired behavior of the code above?

To parse correctly TRUE, True, FALSE and False as booleans.

Frodo45127 avatar Oct 02 '18 15:10 Frodo45127

Can you provide a specification for which values should be interpreted as booleans? What about TrUe? Or falsE? Or 1? Or 2?

I might be able to get on board with a case insensitive comparison, but the actual scope of what is a boolean is interesting. It kind of seems like 1 and 0 should also be interpretable as booleans.

Note that you can work around this pretty easily with a custom deserialize_with function.

BurntSushi avatar Oct 02 '18 16:10 BurntSushi

I haven't been programming for too long, so this list may be a bit incomplete, but I've seen the following values interpreted as booleans:

  • true/false, with any combination of lowercase/uppercase, like trUe or fAlSe.
  • 0/1, being 0 = false and 1 = true.
  • yes/no, with any combination of lowercase/uppercase.
  • t/f, being t = true and f = false.
  • y/n, being y = yes/true and n = no/false.

Those are more or less all the interpretations I've seen in the wild but the first two were by far the most common. I've barely seen the other ones being used. So an specification of What can be considered an interpretable boolean would be:

  • true/false: or any combination of lowercase/uppercase, like trUe or fAlSe.
  • 0/1: being 0 = false and 1 = true.

Frodo45127 avatar Oct 02 '18 16:10 Frodo45127

I have this very same issue but in the other direction: I need to write CSV files that I can load into a MySQL database and boolean fields need to be represented as 0/1 for this to work. But false/true are hard-coded in the writer. Is there a convenient way to customize this?

lunikon avatar Mar 31 '20 11:03 lunikon

Yes. You can use serde's attributes to specify a different serialization function. Or use a custom type.

BurntSushi avatar Mar 31 '20 11:03 BurntSushi

Neither adding attributes to every boolean field nor using a custom type for something as basic as a boolean sounds too appealing. Customizing the writer with the desired string representation would be ideal for me. But I'll have another go at the Serde docs. Thanks a lot!

lunikon avatar Mar 31 '20 11:03 lunikon

Was kind of hoping this was built in... I think this probably solves the original question, though?


use serde::{Deserialize, Deserializer, Serialize};
use serde::de::{self, Unexpected};

/// Converts a string to a boolean based on truthy and falsy values
fn bool_from_str<'de, D>(deserializer: D) -> Result<bool, D::Error>
where
    D: Deserializer<'de>,
{
    match String::deserialize(deserializer)?.to_lowercase().as_str() {
        "t" | "true" | "1" | "on" | "y" | "yes" => Ok(true),
        "f" | "false" | "0" | "off" | "n" | "no" => Ok(false),
        other => Err(de::Error::invalid_value(
            Unexpected::Str(other),
            &"Must be truthy (t, true, 1, on, y, yes) or falsey (f, false, 0, off, n, no)",
        )),
    }
}

#[derive(Clone, Debug, Deserialize, Serialize)]
struct Foo {
    /// Some field
    #[serde(deserialize_with = "bool_from_str")]
    bar: bool
}

Original version: Playground

brianbruggeman avatar Dec 30 '20 22:12 brianbruggeman

@brianbruggeman As a minor refinement, I'd consider changing String::deserialize() to Cow::<str>::deserialize() to avoid allocations when possible (which should be in most cases). Playground.

Edit: Cow<str> doesn't help when used this way because its deserialize() always allocates. Serde does have code that special-cases Cow, but it is part of the proc-macro, so Cow must be used as a field type.

hniksic avatar Mar 03 '22 20:03 hniksic

@brianbruggeman As a minor refinement, I'd consider changing String::deserialize() to Cow::<str>::deserialize() to avoid allocations when possible (which should be in most cases). Playground.

@hniksic Unfortunately, that doesn't change anything. A Cow is always deserialized into its Owned variant (https://docs.rs/serde/latest/src/serde/de/impls.rs.html#1823-1835), which for Cow<'_, str> means a String. What you likely want is a transient str, but for this you must use a Visitor. The visit_str function can then return the bool.

jonasbb avatar Mar 03 '22 21:03 jonasbb

@jonasbb Thanks for the correction. Interestingly, it appears to be possible to force borrowing using #[serde(borrow)], but that doesn't look like it generalizes to this case.

hniksic avatar Mar 03 '22 22:03 hniksic

It may not really matter. The to_lowercase is problematic as it will allocate a String. I suppose it is possible to move around that with more optional truthy/falsey values, but I suspect that may make maintenance more unwieldy (true, True, tRue, tRUE, ...).

brianbruggeman avatar Mar 03 '22 22:03 brianbruggeman

To redeem myself, here is the visitor-based version. It calls to_lowercase() only in the error path which is (IMHO) a reasonable compromise between efficiency and maintainability.

use serde::de::{self, Unexpected, Visitor};
use serde::{Deserialize, Deserializer, Serialize};

/// Converts a string to a boolean based on truthy and falsy values.
///
/// Designed to be used as #[serde(deserialize_with = "bool_from_str")]
fn bool_from_str<'de, D>(deserializer: D) -> Result<bool, D::Error>
where
    D: Deserializer<'de>,
{
    struct BoolVisitor;
    impl Visitor<'_> for BoolVisitor {
        type Value = bool;
        fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
            write!(
                formatter,
                "truthy (t, true, 1, on, y, yes) or falsey (f, false, 0, off, n, no) string"
            )
        }
        fn visit_str<E: de::Error>(self, s: &str) -> Result<bool, E> {
            match s {
                "t" | "T" | "true" | "True" | "1" | "on" | "On" | "y" | "Y" | "yes" | "Yes" => {
                    Ok(true)
                }
                "f" | "F" | "false" | "False" | "0" | "off" | "Off" | "n" | "N" | "no" | "No" => {
                    Ok(false)
                }
                other => {
                    // handle weird mixed-case spellings like tRue or nO
                    match other.to_lowercase().as_str() {
                        "true" | "on" | "yes" => Ok(true),
                        "false" | "off" | "no" => Ok(false),
                        other => Err(de::Error::invalid_value(Unexpected::Str(other), &self)),
                    }
                }
            }
        }
    }

    deserializer.deserialize_str(BoolVisitor)
}

Playground

hniksic avatar Mar 03 '22 22:03 hniksic

If anyone is curious, I've also recently added a tiny crate, serde-this-or-that which hopefully should handle converting numbers and strings (including the edge cases above) to a bool type. It is available on crates.io as well.

rnag avatar Apr 19 '22 17:04 rnag