This PR is an alternative to #340, although it would be possible to add precompiles to this in a later PR. In general, I've found that the first and most impactful step is to analyze specialization costs using SnoopCompile's pgdsgui tool. Another reason to do this first is because it affects the MethodInstances and hence the specific precompiles that get generated.

Using the following benchmark,

@time begin
    include("test/rw.jl")
    include("test/loadsave.jl")
end

I get the following timings:

master prior to merge of #310: 65.2s master: 61.6s (so yes, it helped) this PR: 47.4s

That said, there are some concerns. Some benchmarks---notably, the "one_element_arrays" benchmarks---are substantially slower. This can be fixed with the following diff:

diff --git a/src/datasets.jl b/src/datasets.jl
index 803c7b5..6678cd6 100644
--- a/src/datasets.jl
+++ b/src/datasets.jl
@@ -160,7 +160,7 @@ end
 
 # Most types can only be scalars or arrays
 function read_data(f::JLDFile,
-     @nospecialize(rr),
+     rr,
      read_dataspace::Tuple{ReadDataspace,RelOffset,Int,UInt16},
      attributes::Union{Vector{ReadAttribute},Nothing}=nothing)
 
@@ -215,7 +215,6 @@ function read_data(f::JLDFile,
                              ReadRepresentation{T,CustomSerialization{S,nothing}} where {S,T}},
                    read_dataspace::Tuple{ReadDataspace,RelOffset,Int,UInt16},
                    attributes::Vector{ReadAttribute})
-    @nospecialize rr
     dataspace, header_offset, data_length, filter_id = read_dataspace
     filter_id != 0 && throw(UnsupportedFeatureException())
     dataspace.dataspace_type == DS_NULL || throw(UnsupportedFeatureException())
@@ -253,7 +252,7 @@ function read_empty(rr::ReadRepresentation{T}, f::JLDFile,
     jlread(io, FixedPointDatatype) == h5fieldtype(f, Int64, Int64, Val{true}) || throw(UnsupportedFeatureException())
 
     seek(io, dimensions_attr.data_offset)
-    v = construct_array(io, T, ndims)::Array{T}
+    v = construct_array(io, T, ndims)
     if isconcretetype(T)
         # Performance-wise, it would be better to move this to a separate function (if performance here matters) since inference can'T
         # know the dimensionality of `v`
@@ -291,7 +290,7 @@ end
 Construct array by reading `ndims` dimensions from `io`. Assumes `io` has already been
 seeked to the correct position.
 """
-function construct_array(io::IO, @nospecialize(T::Type), ndims::Int)
+function construct_array(io::IO, ::Type{T}, ndims::Int)::Array{T} where T
     if ndims == 1
         n = jlread(io, Int64)
         Vector{T}(undef, n)
@@ -312,7 +311,7 @@ end
 
 
 function read_array(f::JLDFile, dataspace::ReadDataspace,
-                    @nospecialize(rr::ReadRepresentation), data_length::Int,
+                    rr::ReadRepresentation, data_length::Int,
                     filter_id::UInt16, header_offset::RelOffset,
                     attributes::Union{Vector{ReadAttribute},Nothing})
     T = eltype(rr)
@@ -321,7 +320,7 @@ function read_array(f::JLDFile, dataspace::ReadDataspace,
     ndims, offset = get_ndims_offset(f, dataspace, attributes)
 
     seek(io, offset)
-    v = construct_array(io, T, Int(ndims))::Array{T}
+    v = construct_array(io, T, Int(ndims))
     n = length(v)
     seek(io, data_offset)
     if filter_id !=0
@@ -345,10 +344,10 @@ end
 
 function write_dataset(
         f::JLDFile,
-        @nospecialize(dataspace::WriteDataspace),
+        dataspace::WriteDataspace,
         datatype::H5Datatype,
-        @nospecialize(odr),
-        @nospecialize(data::Array),
+        odr,
+        data::Array,
         wsession::JLDWriteSession,
         compress = f.compress,
         )

but unfortunately it gives up most of the advantage.

There does appear to be a way to get the best of both worlds: adjust the dispatch chains so that rr is constructed at the last possible minute. Quite a few methods do some "generic" stuff while just passing down rr. If the call to jltype can be deferred (in the dispatch chain) until it's really needed, you'll compile fewer specializations and substantially reduce latency without compromising on runtime performance.

It should also be said that this PR improves some benchmarks, for example "int_vector serialize" drops from 3.6s to 1.4s. I did not test whether it was due to the improvements in the type definitions or the fact that with fewer specializations you often obtain less runtime dispatch (if there is only one MethodInstance to call, even if Julia can't infer the types it knows which one to call).

Overall, I'd say it might be best to take this PR as a starting point rather than something that should be merged immediately, unless the one-element-array case isn't really very important.

Aug 07 '21 12:08 timholy

CC @thchr

Aug 07 '21 12:08 timholy

Codecov Report

Merging #344 (239b296) into master (cf18c24) will increase coverage by 4.02%. The diff coverage is 98.14%.

@@            Coverage Diff             @@
##           master     #344      +/-   ##
==========================================
+ Coverage   86.01%   90.04%   +4.02%     
==========================================
  Files          28       27       -1     
  Lines        2810     2731      -79     
==========================================
+ Hits         2417     2459      +42     
+ Misses        393      272     -121

Impacted Files	Coverage Δ
src/datasets.jl	`92.28% <93.33%> (+0.89%)`	:arrow_up:
src/JLD2.jl	`90.80% <100.00%> (+1.66%)`	:arrow_up:
src/compression.jl	`96.82% <100.00%> (ø)`
src/data/reconstructing_datatypes.jl	`74.25% <100.00%> (+0.59%)`	:arrow_up:
src/data/specialcased_types.jl	`95.69% <100.00%> (+0.09%)`	:arrow_up:
src/data/writing_datatypes.jl	`96.90% <100.00%> (+0.32%)`	:arrow_up:
src/dataio.jl	`98.52% <100.00%> (+0.06%)`	:arrow_up:
src/dataspaces.jl	`92.59% <100.00%> (ø)`
src/datatypes.jl	`82.51% <100.00%> (+2.09%)`	:arrow_up:
src/loadsave.jl	`95.23% <100.00%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cf18c24...239b296. Read the comment docs.

Aug 07 '21 12:08 codecov[bot]

One could add a local version of inferencebarrier to support Julia 1.0. https://github.com/JuliaLang/julia/blob/4931faa34a8a1c98b39fb52ed4eb277729120128/base/essentials.jl#L758

Aug 07 '21 12:08 timholy

Hi @timholy , thanks so much for looking into this. Your suggestions look promising. Right now, I don't have the time to pursue this in depth but I'll come back to it in a few weeks. If anyone else wants to play with it in the mean-time, I'm available for discussion.

Aug 09 '21 12:08 JonasIsensee

Many thanks from here as well @timholy - a bunch to learn from studying your techniques here..! I had been playing around briefly with a simpler "workload", namely just testing what took time in saving/loading some simple data structures:

using SnoopCompile, Profile, JLD2

function workload()
    jldsave("example.jld2";
                a=1, b=2.0,  c=[1.0,2.0], d=rand(ComplexF64,2000,2000),
                e=Dict(s => 2 for s in ["a", "b", "c"]))
    x = Vector{Any}(undef, 6)
    jldopen("example.jld2") do file
        x[1], x[2], x[3], x[4], x[5] = file["a"], file["b"], file["c"], file["d"], file["e"]
    end
    return x
end

tinf = @snoopi_deep @profile workload()

Followed by:

using PyPlot
mref, ax = pgdsgui(tinf; by=exclusive)

I revisited this now with your commit as comparison. Playing around with the slightly simpler workload highlighted many of the same methods as I can see you have touched here (write_data, write_dataset, read_scalar, eltype, commit, etc.) - but also highlighted getproperty and setproperty which surprised me a little. I'll need to study your tricks in the commit here to understand the actions you take to remedy the problems; but it's a very nice and complete casestudy - thanks again!

Thanks also bunch for taking the time to explain your further in #340 and in particular for pointing out that:

I should remove the precompile statements before trying the despecialization; makes a bunch of sense...
The by=exclusive kwarg to pgdsgui

One quick question (if you have the time): in #340, you had a line Profile.init(delay=0.01, n=10^7), i.e. you increase the time delay between backtraces - what's the motivation for this? Should one generally be reducing this for SnoopCompile, or?

As a practical matter, I think this seems very worthwhile to do - and also to remove the precompile statements I had added (I saw your "prioritized steps list"; makes sense!).

Aug 09 '21 20:08 thchr

but also highlighted getproperty and setproperty which surprised me a little.

These you need to ignore. Every struct that has an obj.x access calls getproperty and constant-specializes them, but they show up as inference workloads. No good way around it short of making inference faster (which may be difficult/impossible), I'm afraid.

you increase the time delay between backtraces - what's the motivation for this?

Any time you want to collect more snapshots than there is room for in the profile buffer, you might consider this. I typically leave it where it is until a workload takes ~10s, and then I start dropping it. When running the DataFrames test suite I only take a snapshot every 100ms.

Aug 09 '21 20:08 timholy

Oh, and we should definitely restore precompilation eventually, but this is going to change a lot of MethodInstances, so best to re-run that analysis later. And I do find that the cross-Julia-versions issues seem to go away if you do precompilation by executing a workload. You also neatly solve 32-bit vs 64-bit CPU architectures, something that's a really hard problem from the standpoint of precompile directives.

Aug 09 '21 21:08 timholy

JLD2.jl
JLD2.jl copied to clipboard

RFC: Improve some type definitions and decrease specialization

Codecov Report

JLD2.jl JLD2.jl copied to clipboard

RFC: Improve some type definitions and decrease specialization

Codecov Report

JLD2.jl
JLD2.jl copied to clipboard