BSON.jl
BSON.jl copied to clipboard
Error when saving Flux models/CuArrays from GPU
Saved Flux models or CuArrays while they are in GPU Memory can only be loaded again in the same julia session. Once this session is terminated and a new session is started, loading this data will either result in random values or in a CUDA error. This can make trained and saved models useless since most of the time they will be loaded in a new julia session... MWE (for CuArrays):
using BSON: @save, @load
using CUDAdrv
using CuArrays
using Flux
data = [1 2 3; 4 5 6]
data = data |> gpu
@show data
@save "data.bson" data
@load "data.bson" data
@show data
this gives me the correct output:
data = Float32[1.0 2.0 3.0; 4.0 5.0 6.0]
data = Float32[1.0 2.0 3.0; 4.0 5.0 6.0]
2×3 CuArray{Float32,2}:
1.0 2.0 3.0
4.0 5.0 6.0
Loading the data in a new session
using BSON: @load
using CUDAdrv
using CuArrays
using Flux
@load "data.bson" data
@show data
will result in an error:
ERROR: CUDA error: invalid argument (code #1, ERROR_INVALID_VALUE)
Stacktrace:
[1] macro expansion at /home/user/.julia/packages/CUDAdrv/WVU1H/src/base.jl:147 [inlined]
[2] #copy!#10(::Nothing, ::Bool, ::Function, ::Ptr{Float32}, ::CUDAdrv.Mem.DeviceBuffer, ::Int64) at /home/user/.julia/packages/CUDAdrv/WVU1H/src/memory.jl:344
[3] copy! at /home/user/.julia/packages/CUDAdrv/WVU1H/src/memory.jl:335 [inlined]
[4] copyto!(::Array{Float32,2}, ::Int64, ::CuArray{Float32,2}, ::Int64, ::Int64) at /home/user/.julia/packages/CuArrays/PwSdF/src/array.jl:194
[5] show(::Base.GenericIOBuffer{Array{UInt8,1}}, ::CuArray{Float32,2}) at /home/user/.julia/packages/GPUArrays/fAX0Q/src/abstractarray.jl:101
[6] #sprint#340(::Nothing, ::Int64, ::Function, ::Function, ::CuArray{Float32,2}) at ./strings/io.jl:101
[7] #sprint at ./none:0 [inlined]
[8] #repr#341 at ./strings/io.jl:208 [inlined]
[9] repr(::CuArray{Float32,2}) at ./strings/io.jl:208
[10] top-level scope at show.jl:555
The error occurs during the show command and not during loading!
I experienced the same issue when I tried to save Flux models. Saving and loading worked without errors but the loaded model had not the trained weights but random values.
The documentation of Flux only says that GPU support needs to be available when loading models which where in GPU memory when saved.
Right, you can't save CuArrays with BSON.jl. Doing data = data |> Flux.cpu
before saving your model should fix this (of course, when you load it again it will just be a regular array, not a CuArray
).
Right, you can't save CuArrays with BSON.jl. Okay, but then doing this should at least produce some error. Like in many other applications, no output (error/warning message) means everything went as expected! You can get in big trouble if you are not aware of this issue and try to save your model after a time consuming learning phase...
Doing data = data |> Flux.cpu before saving your model should fix this This is what I am doing now too, but I think the documentation of Flux should be more clear about this
Agreed. If you have time, it would be great if you can submit a Flux PR to make this very clear in the docs.
Agreed, just had the same problem and it's written nowhere in the Flux documentation
Sounds like this issue has been resolved?
Out of curiosity, it is possible to overload some method so that when someone tries to save CuArray to BSON, it copies the data into an Array and saves that? And maybe it's possible it's possible to save a tiny bit of metadata so that when loading a "CuArray" from disk, it creates an Array and copies the data over into a newly created CuArray?
I would say model = model |> gpu
is not a solution because it corrupts the correspondence in stateful optimisers. For example the Adam optimiser uses an IdDict to keep track of the momentum for different params. After |> gpu
the object ids change and the optimiser state has to start from scratch. This means in the end we are not able to resume training. The BSON saving and loading can only be used for training => saving and then loading => inference. Restarting training with a blank optimiser state would corrupt reproducability
If you have time, it would be great if you can submit a Flux PR to make this very clear in the docs.
@jpsamaroo a fellow student hit this bug last week, I'm thinking throw an info
and automatically moved to CPU for them.
Where should this be, https://github.com/JuliaGPU/CUDA.jl/blob/603edb87891da8fd5b2623f17544aebe9706069a/src/array.jl#L68
unfortunately there's no interface package defining this type, so I'm thinking adding a Requires.jl
here at BSON?