Handle undefined values in arrays
For now this just writes undefineds as nothing (which works because we don't preserve non-isbits-eltypes anyway). Probably better to use a different sentinel and preserve undefined-ness through a roundtrip.
See #6, cc @DilumAluthge
Does this mess up the Array type? I.e. instead of being of type Array{T} it would turn into Array{Union{T, Void}}?
Yeah so right now this doesn't work for saving Dicts.
import BSON
x = Dict("a" => :a)
BSON.@save "test.bson" x
Now quit Julia and re-open it.
import BSON
BSON.@load "test.bson" x
You get the following error:
ERROR: MethodError: Cannot `convert` an object of type Void to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.
Stacktrace:
[1] copy!(::IndexLinear, ::Array{String,1}, ::IndexLinear, ::Array{Any,1}) at ./abstractarray.jl:656
[2] convert(::Type{Array{String,1}}, ::Array{Any,1}) at ./array.jl:368
[3] newstruct!(::Dict{String,Symbol}, ::Array{UInt8,1}, ::Array{Any,1}, ::Array{Any,1}, ::Int64, ::Int64, ::UInt64, ::Int64, ::Int64) at /Users/dilum/.julia/v0.6/BSON/src/extensions.jl:92
[4] newstruct_raw(::ObjectIdDict, ::Type{T} where T, ::Dict{Symbol,Any}) at /Users/dilum/.julia/v0.6/BSON/src/extensions.jl:110
[5] (::BSON.##53#54)(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/extensions.jl:125
[6] raise_recursive(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:80
[7] (::BSON.##24#26{ObjectIdDict})(::Dict{Symbol,Any}) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:74
[8] applychildren!(::BSON.##24#26{ObjectIdDict}, ::Dict{Symbol,Any}) at /Users/dilum/.julia/v0.6/BSON/src/BSON.jl:18
[9] _raise_recursive(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:74
[10] raise_recursive(::Dict{Symbol,Any}, ::ObjectIdDict) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:81
[11] load(::String) at /Users/dilum/.julia/v0.6/BSON/src/read.jl:96
I think that when we parse a doc, we should skip the nothings.
I think we should just use BSON's undefined as a sentinel. It's technically deprecated but I'm not sure we can really expect this case to have good interop with other readers anyway.
Would it also handle missing? The test case is:
df = DataFrame(x=1:3, b = [:a,missing,:c], c = rand(3), d = categorical(["a","b",missing]))
Trying to save it currently crashes Julia.
@MikeInnes Should we revisit this? It would be really nice to be able to write arrays with undefined values.
I agree that we should probably just use the BSON undefined as our sentinel.
Would it also handle
missing? The test case is:df = DataFrame(x=1:3, b = [:a,missing,:c], c = rand(3), d = categorical(["a","b",missing]))Trying to save it currently crashes Julia.
@bkamins Saving/loading a DataFrame with missing values works now.
Here is the example code for saving: (using the same test case as in your comment)
using Pkg
Pkg.add("BSON")
Pkg.add("CategoricalArrays")
Pkg.add("DataFrames")
using BSON, CategoricalArrays, DataFrames
df = DataFrame(x=1:3, b = [:a,missing,:c], c = rand(3), d = categorical(["a","b",missing]))
bson("test.bson", Dict(:df => df))
Now quit and restart Julia, and then run the example code for loading:
using BSON, CategoricalArrays, DataFrames
data = BSON.load("test.bson")
data[:df]
And the DataFrame has been loaded correctly, missing values and all.
The output of Pkg.status():
julia> Pkg.status()
Status `~/.julia/environments/v1.1/Project.toml`
[fbb218c0] BSON v0.2.3
[324d7699] CategoricalArrays v0.5.2
[a93c6f00] DataFrames v0.18.3
And the output of versioninfo(verbose=true):
julia> versioninfo(verbose=true)
Julia Version 1.1.1
Commit 55e36cc (2019-05-16 04:10 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin15.6.0)
uname: Darwin 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64 i386
CPU: Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz:
speed user nice sys idle irq
#1 2600 MHz 83985 s 0 s 33784 s 256810 s 0 s
#2 2600 MHz 39330 s 0 s 13825 s 321413 s 0 s
#3 2600 MHz 95457 s 0 s 29412 s 249699 s 0 s
#4 2600 MHz 38662 s 0 s 12898 s 323007 s 0 s
Memory: 8.0 GB (571.66796875 MB free)
Uptime: 97398.0 sec
Load Avg: 2.6552734375 2.51123046875 2.33056640625
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
JULIA_DEBUG = all
GEM_HOME = /Users/dilum/.rvm/gems/ruby-2.4.1
TERM = xterm-256color
MY_RUBY_HOME = /Users/dilum/.rvm/rubies/ruby-2.4.1
PATH = /Users/dilum/.rvm/gems/ruby-2.4.1/bin:/Users/dilum/.rvm/gems/ruby-2.4.1@global/bin:/Users/dilum/.rvm/rubies/ruby-2.4.1/bin:/Users/dilum/.rvm/bin:/Users/dilum/.cargo/bin:/Applications/Julia-1.1.app/Contents/Resources/julia/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/usr/local/go/bin:/usr/local/MacGPG2/bin:/opt/X11/bin
XPC_FLAGS = 0x0
HOME = /Users/dilum
GEM_PATH = /Users/dilum/.rvm/gems/ruby-2.4.1:/Users/dilum/.rvm/gems/ruby-2.4.1@global
@MikeInnes
Also, I couldn't find an issue tracking this bug specifically (undefined values in arrays), so I opened the following issue: https://github.com/MikeInnes/BSON.jl/issues/43
In collect_any, Vector{Any}(length(xs)) isn't a valid constructor..., anymore? Im not sure if it used to be or if I am missing something, but I get an error when I run this. I'm not sure there is a way around it.

Also what happens if the position of the data matters in an array. It seems like
collect_all([#undef, #undef, #undef, "foo"]) -> ["foo"]
Which is bad if I want foo to remain in the 4th spot of the array.
Am I correct in assessing this?
I think all this to say that reworking the splat operator for #undef might be needed and working with the undef sentinel is probably the best way to go.
There is a small change to how we call the constructor, but apart from that the collect_any function works fine. Simply replace Vector{Any}(length(xs)) with Vector{Any}(undef, length(xs)).
function collect_any(xs)
ys = Vector{Any}(undef, length(xs))
for i = 1:length(xs)
isassigned(xs, i) && (ys[i] = xs[i])
end
return ys
end
And then these examples work exactly as expected:
julia> function collect_any(xs)
ys = Vector{Any}(undef, length(xs))
for i = 1:length(xs)
isassigned(xs, i) && (ys[i] = xs[i])
end
return ys
end
collect_any (generic function with 1 method)
julia> a = [1,2,3,4,5]
5-element Array{Int64,1}:
1
2
3
4
5
julia> collect_any(a)
5-element Array{Any,1}:
1
2
3
4
5
julia> b = Array{String}(undef, 5)
5-element Array{String,1}:
#undef
#undef
#undef
#undef
#undef
julia> b[2] = "hello"
"hello"
julia> b[4] = "world"
"world"
julia> b
5-element Array{String,1}:
#undef
"hello"
#undef
"world"
#undef
julia> collect_any(b)
5-element Array{Any,1}:
#undef
"hello"
#undef
"world"
#undef
I just ran this code on Julia 1.1.1 and it worked as expected.
I just ran this code on Julia 1.1.1 and it worked as expected.
K great, that's what I was getting at, the undef part was missing. Thanks