Missings.jl icon indicating copy to clipboard operation
Missings.jl copied to clipboard

parse with Missing

Open bkamins opened this issue 7 years ago • 23 comments

I would like to put under consideration if we want to add methods to parse that accept Union{T, Missing} argument (or create a similar method). This would be something like tryparse but returning missing on failure.

I would think of something like:

  • parse(Union{Int, Missing}, "123") == 123;
  • parse(Union{Int, Missing}, "123a") == missing.

bkamins avatar Dec 01 '17 22:12 bkamins

Actually with https://github.com/JuliaLang/julia/pull/23642 tryparse will have return type Union{Some{T}, Void} (or maybe just Union{T, Void} depending on the outcome of discussions). I'm not sure we want to allow parse(Union{T, Missing}, ...) in addition to that. There have been lots of discussions about this kind of syntax around the introduction of tryparse in Base. One issue is that parse(Union{T, Missing}, x) could be syntax to parse "missing" as missing, numbers as Int, and throw and error in other cases.

nalimilan avatar Dec 02 '17 13:12 nalimilan

Close this issue then or keep for a reference?

bkamins avatar Dec 02 '17 13:12 bkamins

I'd say it depends on whether you think this syntax is needed or not.

nalimilan avatar Dec 02 '17 13:12 nalimilan

I believe we need a function that returns missing if parsing fails. It does not have to be prase, actually tryparse is more natural, but now it returns Nullable (which behavior I earlier thought would be retained as computer scientist missingness).

So let me comment on a use case and you can decide which way to go (I have much less understanding of the whole ecosystem context). A typical scenario with DataFrames is to read some data which is messy but is supposed to contain integers. Assume that we already have it as ["1", "2", "?", "", "3", "unknown"].

A pure approach would be to handle the missing data carefully, but a practical workflow (and this is how R works) is to parse it as Vector{Int} and make all that is not an integer a missing.

Computer scientist now can do:

julia> tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])
6-element Array{Nullable{Int64},1}:
 1
 2
 #NULL
 #NULL
 3
 #NULL

But data scientist has to do (or there is some simpler way?):

julia> [isnull(x) ? missing : get(x) for x in tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])]
6-element Array{Any,1}:
 1
 2
  missing
  missing
 3
  missing

which not only is messy but also has wrong type (Any) of the produced array.

I would like to have this fixed (I understand that solving Any problem is an additional topic discussed in https://github.com/JuliaLang/julia/pull/24332).

bkamins avatar Dec 02 '17 14:12 bkamins

I see your point. I'm not sure we want to follow R in all aspects, in particular we have the ability to distinguish an invalid value from a missing value, so it wouldn't be absurd to be stricter, and require an explicit operation to convert an unparsable value to a missing value. That could be done quite easily using replace (see https://github.com/JuliaLang/julia/pull/22324). Adding a function as you suggest also makes sense, but I'd rather wait for the design of tryparse to settle first (JuliaLang/julia#23642), we can easily extend it later.

nalimilan avatar Dec 02 '17 14:12 nalimilan

Thanks.

bkamins avatar Dec 02 '17 14:12 bkamins

I also don't think parse() should return missing for invalid inputs.

But somewhat more conservative thing would be to allow

parse(Int, missing) = missing
#= or =# 
parse(Union{Int, Missing}, missing) = missing

I prefer the first alternative as the 2nd might imply: parse(Union{Int, Missing}, "missing") == missing.

Then it would be possible to parse.(Int, ::AbstractArray{Union{String, Missing}})

alyst avatar Dec 13 '17 14:12 alyst

The approach taken in CSV.jl, which @shashi and I have discussed generalizing to parsing functions in general is to have parse(Union{Int, Missing}, str) => Union{Int, Missing} and parse(Int, "missing") => throw(ParsingException()).

quinnj avatar Dec 13 '17 20:12 quinnj

@quinnj Do you also throw ParsingException() on parse(Int, ::Missing)?

alyst avatar Dec 13 '17 21:12 alyst

I don't currently have a use-case for that in CSV as there's always guaranteed to be an input, though the input may be an empty string.

quinnj avatar Dec 13 '17 22:12 quinnj

So you are using missing to indicate that parsing failed? Basically what tryparse does with Nullable() (and will soon do with nothing)? I think it'd prefer using parse(Union{Int, Void}, ...) for that, to distinguish clearly between "invalid" and "missing".

nalimilan avatar Dec 14 '17 14:12 nalimilan

@nalimilan was that in response to my comment or @alyst ?

quinnj avatar Dec 14 '17 15:12 quinnj

To yours.

nalimilan avatar Dec 14 '17 15:12 nalimilan

No, the user has to provide, in the case of CSV, a "null" string and if the input matches this "null" string, then missing is returned. In the case of failure, e.g. non-numeric digits when parsing a number, an error is always thrown. That leaves try-catch behavior up to the caller, who would know what to do in case of failure anyway. I supposed you could keep a tryparse function that returned nothing instead of throwing.

quinnj avatar Dec 14 '17 16:12 quinnj

A user ran into this with Dates recently, where Dates throws an error on parsing missings.

https://github.com/JuliaLang/julia/issues/28570

I would recommend a keyword argument, but if a constructor relies on parsing, we have to trust that they forward keyword arguments as well.

pdeffebach avatar Aug 12 '18 15:08 pdeffebach

Since Missing now lives in Base, this feature proposal should be discussed there.

nalimilan avatar Aug 15 '18 17:08 nalimilan

@nalimilan Right. Has a proposal / discussion started over there? Let's link to it if so.

xpe avatar Feb 09 '19 18:02 xpe

Actually I would close it (although I know I have raised the issue in the first place).

The current meta-understanding of types in Base is the following AFAICT:

  • nothing: the value does not exist
  • missing: the value exists, but we do not know it

I would argue that tryparse follows this rule and returns nothing on error - as the proper value does not exist. And I would keep this approach for consistency. Of course this is a bit debatable as the distinction here is thin.

But maybe there are cases when would be convenient to return missing - if yes we could keep discussing it.

bkamins avatar Feb 09 '19 19:02 bkamins

Yes, the case for parsing to Union{T,Missing} seems to be that you want a sentinel value (e.g. "missing", "" or "NA") to be converted to missing. But invalid values should not be converted to missing. I imagine we could have something like parse(Union{T, Missing}, x, missing="missing").

I'm not aware of any issue in Base about this currently.

nalimilan avatar Feb 10 '19 12:02 nalimilan

Yes - the sentinel approach could make sense. It could be also more general, e.g. allowing a Regex or a predicate.

bkamins avatar Feb 10 '19 12:02 bkamins

Having a parse (or tryparse) function which can output missing will greatly improve workflow with DataFrame. See https://github.com/JuliaData/DataFrames.jl/issues/1854

s-celles avatar Jun 20 '19 19:06 s-celles

@scls19fr - as you can see from the discussion above the issue is not that simple. What syntax would you think would be the most convenient while consistent (the consistency issue was a major concern in the past).

bkamins avatar Jun 20 '19 19:06 bkamins

I think one should output nothing when using parse/tryparse on missing/nothing data. Since it's tryparse is already outputting nothing on invalid strings. Especially since nothing is suppose to represent "does not exist", so parsing missing/nothing should give a result that "does not exist".

pepelovesvim avatar Sep 18 '20 07:09 pepelovesvim