CSVFiles.jl icon indicating copy to clipboard operation
CSVFiles.jl copied to clipboard

Problem with nastring for non-numeric columns

Open bkamins opened this issue 7 years ago • 4 comments

Because the default nastring is NA there is a following problem:

  1. take a data structure that has e.g. String column with missing data in it;
  2. save it to disk using default parameters; missings get converted to NA on disk
  3. load it back and you have "NA" string where you earlier had missings

The same problem occurs with e.g. Char data.

While NA is a sensible default for numeric columns it is a bit confusing for non-numeric columns (and actually can lead to wrong results as it is fully possible to have NA string in data).

I think that it would be best to have an empty string for missings in non-numeric data.

bkamins avatar Sep 10 '18 07:09 bkamins

I think that it would be best to have an empty string for missings in non-numeric data.

That would be for writing files, right? Do you think we need to also change something about reading?

davidanthoff avatar Sep 10 '18 20:09 davidanthoff

Frankly - for reading I would never create a missing when reading a String but leave as is and let the user decide what to do. It is perfectly possible that "NA" sting means something and is present if a CSV file. E.g. in Polish this is a valid word.

A second best solution would be to treat empty string as missing (although I can imagine situations where "" might mean something, e.g. it is perfectly valid to have the following vector in Julia ["", missing], but at least it is not that problematic).

However, I realize that all this is breaking so please decide what you think is best in the context of whole queryverse.

bkamins avatar Sep 10 '18 20:09 bkamins

Well, now is the time to break things! I haven't released the julia 1.0 version officially, and I'm willing to break things with that transition, and then hopefully not again for a long time (until we see julia 2.0).

I think my own instinct would be to only return NA in the following situation: a column is string, and uses quotation marks throughout, and then has some rows where NA appears without quotes. For the other cases, I agree with you: if NA appears inside quotes, I think there can be no question that it should just be read as "NA", and if a column generally doesn't use quotes, then it probably also is better to return it as the "NA" string...

All of the reading logic is actually handled in TextParse.jl, so I'll have to figure out what the default there are...

davidanthoff avatar Sep 10 '18 21:09 davidanthoff

Good point - if all is quoted and only NA is unquoted this a clear way do distinguish it. This is what write.csv in R does (although then read.csv reads back both of them as missing 😄).

bkamins avatar Sep 10 '18 21:09 bkamins