ReadStat.jl icon indicating copy to clipboard operation
ReadStat.jl copied to clipboard

Allow either DataValues or Missings for representing missing data

Open dmbates opened this issue 8 years ago • 6 comments

@davidanthoff I wanted to run this by you before preparing a PR.

I think it should be possible to represent missing data values in the result using either the DataValues package or the Missings package. Suppose that all the read_* functions had an optional Bool argument to control this choice with its value being stored in the ReadStatDataFrame. The handle_variable! function then ends with

    push!(ds.data, ds.useMissings ? fill(Union{jtype, Missing}(missing), ds.rows) : DataValueVector{ds.rows))

I think the only additional changes would be to use

dest::Union{DataValueVector{T}, Vector{Union{Missing,T}}}

in the signature of the readfield! methods and to add the Missings package.

Shall I prepare a PR for your consideration?

dmbates avatar Jan 13 '18 17:01 dmbates

What is the use-case for this? https://github.com/davidanthoff/StatFiles.jl should support getting a DataFrame that uses Missing, and my idea here was that ReadStat.jl really is just a low-level library that most folks don't really interact with.

davidanthoff avatar Jan 13 '18 19:01 davidanthoff

I believe that as of version 0.7 of Julia, the Missing type, the missing value and the ismissing function will be part of Base.

dmbates avatar Jan 18 '18 19:01 dmbates

Yes, that is so. Should we revisit once 0.7 is out (or at least a beta or something like that)? There are some really complicated integration questions on how all of this interacts with Query.jl that just will require more time.

But I do hope that in the meantime the approach with StatFiles.jl should work for folks that want a DataFrame that has Vector{Union{T,Missing}} columns, that should all just work already.

davidanthoff avatar Jan 18 '18 19:01 davidanthoff

I agree that waiting until 0.7 is at least in beta is a good idea.

dmbates avatar Jan 18 '18 19:01 dmbates

I don't think 0.7 will change a lot of things (except performance) here. You could as well use Missings.jl for now, it should provide the same features on 0.6.

nalimilan avatar Jan 19 '18 17:01 nalimilan

I have a version using Missing in the julia7 branch in https://github.com/dmbates/ReadStat.jl

I haven't created a pull request yet. I wanted to check first with @davidanthoff to see if he would be open to this approach. This version works on Julia 0.6.2 and Julia 0.7.0-dev.

Tests don't work because this returns Vector{Union{Missing,T}} in the data field. I can add/change tests if this approach is acceptable.

dmbates avatar Jan 20 '18 21:01 dmbates