Missings.jl
Missings.jl copied to clipboard
parse with Missing
I would like to put under consideration if we want to add methods to parse
that accept Union{T, Missing}
argument (or create a similar method). This would be something like tryparse
but returning missing
on failure.
I would think of something like:
-
parse(Union{Int, Missing}, "123") == 123
; -
parse(Union{Int, Missing}, "123a") == missing
.
Actually with https://github.com/JuliaLang/julia/pull/23642 tryparse
will have return type Union{Some{T}, Void}
(or maybe just Union{T, Void}
depending on the outcome of discussions). I'm not sure we want to allow parse(Union{T, Missing}, ...)
in addition to that. There have been lots of discussions about this kind of syntax around the introduction of tryparse
in Base. One issue is that parse(Union{T, Missing}, x)
could be syntax to parse "missing"
as missing
, numbers as Int
, and throw and error in other cases.
Close this issue then or keep for a reference?
I'd say it depends on whether you think this syntax is needed or not.
I believe we need a function that returns missing
if parsing fails. It does not have to be prase
, actually tryparse
is more natural, but now it returns Nullable
(which behavior I earlier thought would be retained as computer scientist missingness).
So let me comment on a use case and you can decide which way to go (I have much less understanding of the whole ecosystem context). A typical scenario with DataFrames
is to read some data which is messy but is supposed to contain integers. Assume that we already have it as ["1", "2", "?", "", "3", "unknown"]
.
A pure approach would be to handle the missing data carefully, but a practical workflow (and this is how R works) is to parse it as Vector{Int}
and make all that is not an integer a missing
.
Computer scientist now can do:
julia> tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])
6-element Array{Nullable{Int64},1}:
1
2
#NULL
#NULL
3
#NULL
But data scientist has to do (or there is some simpler way?):
julia> [isnull(x) ? missing : get(x) for x in tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])]
6-element Array{Any,1}:
1
2
missing
missing
3
missing
which not only is messy but also has wrong type (Any
) of the produced array.
I would like to have this fixed (I understand that solving Any
problem is an additional topic discussed in https://github.com/JuliaLang/julia/pull/24332).
I see your point. I'm not sure we want to follow R in all aspects, in particular we have the ability to distinguish an invalid value from a missing value, so it wouldn't be absurd to be stricter, and require an explicit operation to convert an unparsable value to a missing value. That could be done quite easily using replace
(see https://github.com/JuliaLang/julia/pull/22324). Adding a function as you suggest also makes sense, but I'd rather wait for the design of tryparse
to settle first (JuliaLang/julia#23642), we can easily extend it later.
Thanks.
I also don't think parse()
should return missing
for invalid inputs.
But somewhat more conservative thing would be to allow
parse(Int, missing) = missing
#= or =#
parse(Union{Int, Missing}, missing) = missing
I prefer the first alternative as the 2nd might imply: parse(Union{Int, Missing}, "missing") == missing
.
Then it would be possible to parse.(Int, ::AbstractArray{Union{String, Missing}})
The approach taken in CSV.jl
, which @shashi and I have discussed generalizing to parsing functions in general is to have parse(Union{Int, Missing}, str) => Union{Int, Missing}
and parse(Int, "missing") => throw(ParsingException())
.
@quinnj Do you also throw ParsingException()
on parse(Int, ::Missing)
?
I don't currently have a use-case for that in CSV as there's always guaranteed to be an input, though the input may be an empty string.
So you are using missing
to indicate that parsing failed? Basically what tryparse
does with Nullable()
(and will soon do with nothing
)? I think it'd prefer using parse(Union{Int, Void}, ...)
for that, to distinguish clearly between "invalid" and "missing".
@nalimilan was that in response to my comment or @alyst ?
To yours.
No, the user has to provide, in the case of CSV, a "null" string and if the input matches this "null" string, then missing
is returned. In the case of failure, e.g. non-numeric digits when parsing a number, an error is always thrown. That leaves try-catch behavior up to the caller, who would know what to do in case of failure anyway. I supposed you could keep a tryparse
function that returned nothing
instead of throwing.
A user ran into this with Dates
recently, where Dates
throws an error on parsing missings.
https://github.com/JuliaLang/julia/issues/28570
I would recommend a keyword argument, but if a constructor relies on parsing, we have to trust that they forward keyword arguments as well.
Since Missing
now lives in Base, this feature proposal should be discussed there.
@nalimilan Right. Has a proposal / discussion started over there? Let's link to it if so.
Actually I would close it (although I know I have raised the issue in the first place).
The current meta-understanding of types in Base is the following AFAICT:
-
nothing
: the value does not exist -
missing
: the value exists, but we do not know it
I would argue that tryparse
follows this rule and returns nothing
on error - as the proper value does not exist. And I would keep this approach for consistency. Of course this is a bit debatable as the distinction here is thin.
But maybe there are cases when would be convenient to return missing
- if yes we could keep discussing it.
Yes, the case for parsing to Union{T,Missing}
seems to be that you want a sentinel value (e.g. "missing"
, ""
or "NA"
) to be converted to missing
. But invalid values should not be converted to missing
. I imagine we could have something like parse(Union{T, Missing}, x, missing="missing")
.
I'm not aware of any issue in Base about this currently.
Yes - the sentinel approach could make sense. It could be also more general, e.g. allowing a Regex or a predicate.
Having a parse
(or tryparse
) function which can output missing
will greatly improve workflow with DataFrame. See https://github.com/JuliaData/DataFrames.jl/issues/1854
@scls19fr - as you can see from the discussion above the issue is not that simple. What syntax would you think would be the most convenient while consistent (the consistency issue was a major concern in the past).
I think one should output nothing when using parse/tryparse on missing/nothing data. Since it's tryparse is already outputting nothing on invalid strings. Especially since nothing is suppose to represent "does not exist", so parsing missing/nothing should give a result that "does not exist".