JLD2.jl icon indicating copy to clipboard operation
JLD2.jl copied to clipboard

Explicit Type Remapping & Anonymous Functions

Open JonasIsensee opened this issue 3 years ago • 5 comments

This PR finally implements what is needed to store anonymous functions using JLD2. Most of the julia side of things is borrowed from BSON but additional trickery was needed to integrate all this with JLD2.

AFAICT the memory layout of functions / typenames / methods have changed from julia 1.5 to 1.6 and this PR only supports 1.6.

As a side effect of this, this PR also implements explicit type remapping to allow renaming types on load. This can be useful when working with multiple versions of the same struct. (e.g. old one in the file)

Explicit Type Remapping

Sometimes you store data using structs that you defined yourself or are shipped with some package and weeks later, when you want to load the data, the structs have changed.

using JLD2
struct A
    x::Int
end

jldsave("example.jld2"; a = A(42))

This results in warnings and sometimes even errors when trying to load the file as demonstrated here.

julia> using JLD2

julia> struct A{T}
            x::T
       end

julia> load("example.jld2")
┌ Warning: read type A is not a leaf type in workspace; reconstructing
└ @ JLD2 ~/.julia/dev/JLD2/src/data/reconstructing_datatypes.jl:273
Dict{String, Any} with 1 entry:
  "a" => var"##A#257"(42)

As of JLD2 version v0.4.5 there is a fix. The JLDFile struct contains a type_map dictionary that allows for explicit type remapping. Now you can define a struct that matches the old definition and load your data.

julia> struct A_old
            x::Int
        end

julia> f = jldopen("example.jld2","r")
JLDFile /home/jonas/.julia/dev/JLD2/example.jld2 (read-only)
 └─🔢 a

julia> f.type_map["Main.A"] = A_old
A_old

julia> f["a"]
A_old(42)

closes #208 closes #191 closes #175 closes #288 todo storing typeof(anonfun) #37

JonasIsensee avatar May 16 '21 16:05 JonasIsensee

Codecov Report

Merging #316 (e3f52a7) into master (a9c62a6) will increase coverage by 0.29%. The diff coverage is 98.03%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #316      +/-   ##
==========================================
+ Coverage   89.88%   90.18%   +0.29%     
==========================================
  Files          27       28       +1     
  Lines        2720     2813      +93     
==========================================
+ Hits         2445     2537      +92     
- Misses        275      276       +1     
Impacted Files Coverage Δ
src/file_header.jl 78.57% <ø> (ø)
src/data/anonymous_functions.jl 96.66% <96.66%> (ø)
src/JLD2.jl 90.85% <100.00%> (+1.14%) :arrow_up:
src/data/reconstructing_datatypes.jl 76.36% <100.00%> (+2.36%) :arrow_up:
src/data/writing_datatypes.jl 96.96% <100.00%> (+0.38%) :arrow_up:
src/backwards_compatibility.jl 62.50% <0.00%> (-12.50%) :arrow_down:
src/dataio.jl 98.44% <0.00%> (-0.02%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a9c62a6...e3f52a7. Read the comment docs.

codecov[bot] avatar May 16 '21 16:05 codecov[bot]

what is missing from this PR? AFAIK after this JLD2 would be a better candidate than BSON for serializing Flux's models in basically any situation

CarloLucibello avatar Jun 28 '21 13:06 CarloLucibello

what is missing from this PR? AFAIK after this JLD2 would be a better candidate than BSON for serializing Flux's models in basically any situation

There are two things that are missing:

  • Proper review. Currently, I appear to be the only one familiar enough with JLD2 internals and willing to implement stuff like this. Since JLD2 is used by a lot of people, I was hesitant to just merge this without outside opinions.
  • I'd really like to resolve #37 , but this is a problem quite deeply embedded into JLD2. and not fixable without "breaking" changes.
  • If I merge this PR before fixing #37, then I will have to implement even more legacy stuff to not break anyone's files.

The issue with #37 is this: For every dataset, JLD2 stores essentially the

  • content
  • description of content (e.g. memory layout on disk)
  • name of datatype

This works well for data but for datatypes JLD2 is hardcoded to use the datatype signature as content. Thus, if the signature of a stored datatype is not known in a new julia session, it is impossible to reconstruct.

The fix:

  1. Change serialization of datatypes to contain description of their (instance) layout
  2. Change deserialization to create a new datatype from description when loaded datatype is not known.

JonasIsensee avatar Jun 30 '21 13:06 JonasIsensee

is this branch workable for anonymous functions now? i tried current release version, it saves and loads correctly a dataset containing anonymous functions within a single Julia session, but when i restart a new Julia session and after using the same packages, it loads everything except anonymous functions.

I also tried BSON, JLD, and JLSO, BSON failed saving probably because my dataset contains namedtuples of different types. JLD could save, but failed load. The JLSO is like the JLD2, could save and load in a single session, but can not load in a new session.

babaq avatar Apr 17 '22 01:04 babaq

is this branch workable for anonymous functions now? i tried current release version, it saves and loads correctly a dataset containing anonymous functions within a single Julia session, but when i restart a new Julia session and after using the same packages, it loads everything except anonymous functions.

I also tried BSON, JLD, and JLSO, BSON failed saving probably because my dataset contains namedtuples of different types. JLD could save, but failed load. The JLSO is like the JLD2, could save and load in a single session, but can not load in a new session.

Hi @babaq , I'm afraid it is not. I built this at some point and got it working partially. However, there have been significant changes to how this works between e.g. julia 1.6 and 1.7. So, it is very difficult to get working reliably. Something else you could try out, is #377. This is a Pathfinder PR that would, in principle, allow us to write objects (anonymous function) as binary blobs using the julia Serialization stdlib. It still needs work, but I hope that this is more doable.

JonasIsensee avatar Apr 17 '22 09:04 JonasIsensee