CSV.jl
CSV.jl copied to clipboard
optional argument specifying initializers for missing values
The idea is that something like types = [ String, Float64, Int] missing_values = [ "", -999.0, 12]
could be specified, such that the following would be returned
"abc",1.0,1 ,1.0,1 => "",1.0,1 "def",,1 => "def", -999.0, 1 "ghi",, => "ghi", -999.0, 12
this also has the side benefit of making the types of each column Type instead of Union{Missing, Type}
Sorry, I'm not following the request here. I don't understand the syntax you're using in the data example. Can you explain a little more what exactly you would pass to CSV.read, what the raw data looks like, and what the final parsed result would be?
The missing values vector would have to correspond to the column types. It may be that missing_values could "force" the types of the columns, or it could be that types must be specified in order to use the missing_values option. I'm not sure missing_values is the best name for the option, but it's reasonable.
CSV.read(arguments ... ;types=[String, Float64, Int], missing_values = [ "MISS", -999.0, 12])
So now we have specified values to use if a missing value is encountered when reading a file
Here's the example data
"string1", 1.0, 1 "string2", 2.0, 2 ,3.0,3 "string4",,4 "string5,,
And the results,
"string1", 1.0, 1 "string2", 2.0, 2 "MISS", 3.0, 3 "string4", -999.0, 4 "string5", -999.0, 12
The other important part of this result is that the types of the columns, if the missing values are completely specified, would never be a union of missing and some other type, but strictly the associated datatype.
As a note to myself, I think this should be possible if we allow a way to have the user pass what sentinels/values should be used for SentinelVector. Might be a little tricky to have a consistent story here for all types, including small integers, where we currently don't use SentinelVectors, but could be possible. Going to mark as an eventual enhancement, but not blocking for the next 0.9 release.