BSON.jl icon indicating copy to clipboard operation
BSON.jl copied to clipboard

Handle undefined values in arrays

Open MikeInnes opened this issue 7 years ago • 12 comments

For now this just writes undefineds as nothing (which works because we don't preserve non-isbits-eltypes anyway). Probably better to use a different sentinel and preserve undefined-ness through a roundtrip.

See #6, cc @DilumAluthge

MikeInnes avatar Jun 25 '18 13:06 MikeInnes

Does this mess up the Array type? I.e. instead of being of type Array{T} it would turn into Array{Union{T, Void}}?

DilumAluthge avatar Jun 29 '18 01:06 DilumAluthge

Yeah so right now this doesn't work for saving Dicts.

import BSON
x = Dict("a" => :a)
BSON.@save "test.bson" x

Now quit Julia and re-open it.

import BSON
BSON.@load "test.bson" x

You get the following error:

ERROR: MethodError: Cannot `convert` an object of type Void to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] copy!(::IndexLinear, ::Array{String,1}, ::IndexLinear, ::Array{Any,1}) at ./abstractarray.jl:656
 [2] convert(::Type{Array{String,1}}, ::Array{Any,1}) at ./array.jl:368
 [3] newstruct!(::Dict{String,Symbol}, ::Array{UInt8,1}, ::Array{Any,1}, ::Array{Any,1}, ::Int64, ::Int64, ::UInt64, ::Int64, ::Int64) at /Users/dilum/.julia/v0.6/BSON/src/extensions.jl:92
 [4] newstruct_raw(::ObjectIdDict, ::Type{T} where T, ::Dict{Symbol,Any}) at /Users/dilum/.julia/v0.6/BSON/src/extensions.jl:110
 [5] (::BSON.##53#54)(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/extensions.jl:125
 [6] raise_recursive(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:80
 [7] (::BSON.##24#26{ObjectIdDict})(::Dict{Symbol,Any}) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:74
 [8] applychildren!(::BSON.##24#26{ObjectIdDict}, ::Dict{Symbol,Any}) at /Users/dilum/.julia/v0.6/BSON/src/BSON.jl:18
 [9] _raise_recursive(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:74
 [10] raise_recursive(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:81
 [11] load(::String) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:96

DilumAluthge avatar Jul 01 '18 20:07 DilumAluthge

I think that when we parse a doc, we should skip the nothings.

DilumAluthge avatar Jul 01 '18 20:07 DilumAluthge

I think we should just use BSON's undefined as a sentinel. It's technically deprecated but I'm not sure we can really expect this case to have good interop with other readers anyway.

MikeInnes avatar Jul 04 '18 13:07 MikeInnes

Would it also handle missing? The test case is:

df = DataFrame(x=1:3, b = [:a,missing,:c], c = rand(3), d = categorical(["a","b",missing]))

Trying to save it currently crashes Julia.

bkamins avatar Sep 22 '18 02:09 bkamins

@MikeInnes Should we revisit this? It would be really nice to be able to write arrays with undefined values.

I agree that we should probably just use the BSON undefined as our sentinel.

DilumAluthge avatar Jun 06 '19 07:06 DilumAluthge

Would it also handle missing? The test case is:

df = DataFrame(x=1:3, b = [:a,missing,:c], c = rand(3), d = categorical(["a","b",missing]))

Trying to save it currently crashes Julia.

@bkamins Saving/loading a DataFrame with missing values works now.

Here is the example code for saving: (using the same test case as in your comment)

using Pkg
Pkg.add("BSON")
Pkg.add("CategoricalArrays")
Pkg.add("DataFrames")
using BSON, CategoricalArrays, DataFrames
df = DataFrame(x=1:3, b = [:a,missing,:c], c = rand(3), d = categorical(["a","b",missing]))
bson("test.bson", Dict(:df => df))

Now quit and restart Julia, and then run the example code for loading:

using BSON, CategoricalArrays, DataFrames
data = BSON.load("test.bson")
data[:df]

And the DataFrame has been loaded correctly, missing values and all.

The output of Pkg.status():

julia> Pkg.status()
    Status `~/.julia/environments/v1.1/Project.toml`
  [fbb218c0] BSON v0.2.3
  [324d7699] CategoricalArrays v0.5.2
  [a93c6f00] DataFrames v0.18.3

And the output of versioninfo(verbose=true):

julia> versioninfo(verbose=true)
Julia Version 1.1.1
Commit 55e36cc (2019-05-16 04:10 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin15.6.0)
  uname: Darwin 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64 i386
  CPU: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz:
              speed         user         nice          sys         idle          irq
       #1  2600 MHz      83985 s          0 s      33784 s     256810 s          0 s
       #2  2600 MHz      39330 s          0 s      13825 s     321413 s          0 s
       #3  2600 MHz      95457 s          0 s      29412 s     249699 s          0 s
       #4  2600 MHz      38662 s          0 s      12898 s     323007 s          0 s

  Memory: 8.0 GB (571.66796875 MB free)
  Uptime: 97398.0 sec
  Load Avg:  2.6552734375  2.51123046875  2.33056640625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
  JULIA_DEBUG = all
  GEM_HOME = /Users/dilum/.rvm/gems/ruby-2.4.1
  TERM = xterm-256color
  MY_RUBY_HOME = /Users/dilum/.rvm/rubies/ruby-2.4.1
  PATH = /Users/dilum/.rvm/gems/ruby-2.4.1/bin:/Users/dilum/.rvm/gems/ruby-2.4.1@global/bin:/Users/dilum/.rvm/rubies/ruby-2.4.1/bin:/Users/dilum/.rvm/bin:/Users/dilum/.cargo/bin:/Applications/Julia-1.1.app/Contents/Resources/julia/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/usr/local/go/bin:/usr/local/MacGPG2/bin:/opt/X11/bin
  XPC_FLAGS = 0x0
  HOME = /Users/dilum
  GEM_PATH = /Users/dilum/.rvm/gems/ruby-2.4.1:/Users/dilum/.rvm/gems/ruby-2.4.1@global

DilumAluthge avatar Jun 06 '19 07:06 DilumAluthge

@MikeInnes

Also, I couldn't find an issue tracking this bug specifically (undefined values in arrays), so I opened the following issue: https://github.com/MikeInnes/BSON.jl/issues/43

DilumAluthge avatar Jun 06 '19 08:06 DilumAluthge

In collect_any, Vector{Any}(length(xs)) isn't a valid constructor..., anymore? Im not sure if it used to be or if I am missing something, but I get an error when I run this. I'm not sure there is a way around it.

Screen Shot 2019-06-11 at 12 39 32 PM

Also what happens if the position of the data matters in an array. It seems like collect_all([#undef, #undef, #undef, "foo"]) -> ["foo"] Which is bad if I want foo to remain in the 4th spot of the array. Am I correct in assessing this?

I think all this to say that reworking the splat operator for #undef might be needed and working with the undef sentinel is probably the best way to go.

Codyk12 avatar Jun 11 '19 16:06 Codyk12

There is a small change to how we call the constructor, but apart from that the collect_any function works fine. Simply replace Vector{Any}(length(xs)) with Vector{Any}(undef, length(xs)).

function collect_any(xs)
    ys = Vector{Any}(undef, length(xs))
    for i = 1:length(xs)
        isassigned(xs, i) && (ys[i] = xs[i])
    end
    return ys
end

And then these examples work exactly as expected:

julia> function collect_any(xs)
           ys = Vector{Any}(undef, length(xs))
           for i = 1:length(xs)
               isassigned(xs, i) && (ys[i] = xs[i])
           end
           return ys
       end
collect_any (generic function with 1 method)

julia> a = [1,2,3,4,5]
5-element Array{Int64,1}:
 1
 2
 3
 4
 5

julia> collect_any(a)
5-element Array{Any,1}:
 1
 2
 3
 4
 5

julia> b = Array{String}(undef, 5)
5-element Array{String,1}:
 #undef
 #undef
 #undef
 #undef
 #undef

julia> b[2] = "hello"
"hello"

julia> b[4] = "world"
"world"

julia> b
5-element Array{String,1}:
 #undef
    "hello"
 #undef
    "world"
 #undef 

julia> collect_any(b)
5-element Array{Any,1}:
 #undef
    "hello"
 #undef
    "world"
 #undef

DilumAluthge avatar Jun 11 '19 16:06 DilumAluthge

I just ran this code on Julia 1.1.1 and it worked as expected.

DilumAluthge avatar Jun 11 '19 16:06 DilumAluthge

I just ran this code on Julia 1.1.1 and it worked as expected.

K great, that's what I was getting at, the undef part was missing. Thanks

Codyk12 avatar Jun 11 '19 16:06 Codyk12