JLD2.jl
JLD2.jl copied to clipboard
RFC: Improve some type definitions and decrease specialization
This PR is an alternative to #340, although it would be possible to
add precompiles to this in a later PR. In general, I've found that the
first and most impactful step is to analyze specialization costs using
SnoopCompile's pgdsgui tool. Another reason to do this first is
because it affects the MethodInstances and hence the specific
precompiles that get generated.
Using the following benchmark,
@time begin
include("test/rw.jl")
include("test/loadsave.jl")
end
I get the following timings:
master prior to merge of #310: 65.2s master: 61.6s (so yes, it helped) this PR: 47.4s
That said, there are some concerns. Some benchmarks---notably, the "one_element_arrays" benchmarks---are substantially slower. This can be fixed with the following diff:
diff --git a/src/datasets.jl b/src/datasets.jl
index 803c7b5..6678cd6 100644
--- a/src/datasets.jl
+++ b/src/datasets.jl
@@ -160,7 +160,7 @@ end
# Most types can only be scalars or arrays
function read_data(f::JLDFile,
- @nospecialize(rr),
+ rr,
read_dataspace::Tuple{ReadDataspace,RelOffset,Int,UInt16},
attributes::Union{Vector{ReadAttribute},Nothing}=nothing)
@@ -215,7 +215,6 @@ function read_data(f::JLDFile,
ReadRepresentation{T,CustomSerialization{S,nothing}} where {S,T}},
read_dataspace::Tuple{ReadDataspace,RelOffset,Int,UInt16},
attributes::Vector{ReadAttribute})
- @nospecialize rr
dataspace, header_offset, data_length, filter_id = read_dataspace
filter_id != 0 && throw(UnsupportedFeatureException())
dataspace.dataspace_type == DS_NULL || throw(UnsupportedFeatureException())
@@ -253,7 +252,7 @@ function read_empty(rr::ReadRepresentation{T}, f::JLDFile,
jlread(io, FixedPointDatatype) == h5fieldtype(f, Int64, Int64, Val{true}) || throw(UnsupportedFeatureException())
seek(io, dimensions_attr.data_offset)
- v = construct_array(io, T, ndims)::Array{T}
+ v = construct_array(io, T, ndims)
if isconcretetype(T)
# Performance-wise, it would be better to move this to a separate function (if performance here matters) since inference can'T
# know the dimensionality of `v`
@@ -291,7 +290,7 @@ end
Construct array by reading `ndims` dimensions from `io`. Assumes `io` has already been
seeked to the correct position.
"""
-function construct_array(io::IO, @nospecialize(T::Type), ndims::Int)
+function construct_array(io::IO, ::Type{T}, ndims::Int)::Array{T} where T
if ndims == 1
n = jlread(io, Int64)
Vector{T}(undef, n)
@@ -312,7 +311,7 @@ end
function read_array(f::JLDFile, dataspace::ReadDataspace,
- @nospecialize(rr::ReadRepresentation), data_length::Int,
+ rr::ReadRepresentation, data_length::Int,
filter_id::UInt16, header_offset::RelOffset,
attributes::Union{Vector{ReadAttribute},Nothing})
T = eltype(rr)
@@ -321,7 +320,7 @@ function read_array(f::JLDFile, dataspace::ReadDataspace,
ndims, offset = get_ndims_offset(f, dataspace, attributes)
seek(io, offset)
- v = construct_array(io, T, Int(ndims))::Array{T}
+ v = construct_array(io, T, Int(ndims))
n = length(v)
seek(io, data_offset)
if filter_id !=0
@@ -345,10 +344,10 @@ end
function write_dataset(
f::JLDFile,
- @nospecialize(dataspace::WriteDataspace),
+ dataspace::WriteDataspace,
datatype::H5Datatype,
- @nospecialize(odr),
- @nospecialize(data::Array),
+ odr,
+ data::Array,
wsession::JLDWriteSession,
compress = f.compress,
)
but unfortunately it gives up most of the advantage.
There does appear to be a way to get the best of both worlds: adjust the dispatch chains so that rr is constructed at the last possible minute. Quite a few methods do some "generic" stuff while just passing down rr. If the call to jltype can be deferred (in the dispatch chain) until it's really needed, you'll compile fewer specializations and substantially reduce latency without compromising on runtime performance.
It should also be said that this PR improves some benchmarks, for example "int_vector serialize" drops from 3.6s to 1.4s. I did not test whether it was due to the improvements in the type definitions or the fact that with fewer specializations you often obtain less runtime dispatch (if there is only one MethodInstance to call, even if Julia can't infer the types it knows which one to call).
Overall, I'd say it might be best to take this PR as a starting point rather than something that should be merged immediately, unless the one-element-array case isn't really very important.
CC @thchr
Codecov Report
Merging #344 (239b296) into master (cf18c24) will increase coverage by
4.02%. The diff coverage is98.14%.
@@ Coverage Diff @@
## master #344 +/- ##
==========================================
+ Coverage 86.01% 90.04% +4.02%
==========================================
Files 28 27 -1
Lines 2810 2731 -79
==========================================
+ Hits 2417 2459 +42
+ Misses 393 272 -121
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/datasets.jl | 92.28% <93.33%> (+0.89%) |
:arrow_up: |
| src/JLD2.jl | 90.80% <100.00%> (+1.66%) |
:arrow_up: |
| src/compression.jl | 96.82% <100.00%> (ø) |
|
| src/data/reconstructing_datatypes.jl | 74.25% <100.00%> (+0.59%) |
:arrow_up: |
| src/data/specialcased_types.jl | 95.69% <100.00%> (+0.09%) |
:arrow_up: |
| src/data/writing_datatypes.jl | 96.90% <100.00%> (+0.32%) |
:arrow_up: |
| src/dataio.jl | 98.52% <100.00%> (+0.06%) |
:arrow_up: |
| src/dataspaces.jl | 92.59% <100.00%> (ø) |
|
| src/datatypes.jl | 82.51% <100.00%> (+2.09%) |
:arrow_up: |
| src/loadsave.jl | 95.23% <100.00%> (ø) |
|
| ... and 10 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update cf18c24...239b296. Read the comment docs.
One could add a local version of inferencebarrier to support Julia 1.0. https://github.com/JuliaLang/julia/blob/4931faa34a8a1c98b39fb52ed4eb277729120128/base/essentials.jl#L758
Hi @timholy , thanks so much for looking into this. Your suggestions look promising. Right now, I don't have the time to pursue this in depth but I'll come back to it in a few weeks. If anyone else wants to play with it in the mean-time, I'm available for discussion.
Many thanks from here as well @timholy - a bunch to learn from studying your techniques here..! I had been playing around briefly with a simpler "workload", namely just testing what took time in saving/loading some simple data structures:
using SnoopCompile, Profile, JLD2
function workload()
jldsave("example.jld2";
a=1, b=2.0, c=[1.0,2.0], d=rand(ComplexF64,2000,2000),
e=Dict(s => 2 for s in ["a", "b", "c"]))
x = Vector{Any}(undef, 6)
jldopen("example.jld2") do file
x[1], x[2], x[3], x[4], x[5] = file["a"], file["b"], file["c"], file["d"], file["e"]
end
return x
end
tinf = @snoopi_deep @profile workload()
Followed by:
using PyPlot
mref, ax = pgdsgui(tinf; by=exclusive)
I revisited this now with your commit as comparison. Playing around with the slightly simpler workload highlighted many of the same methods as I can see you have touched here (write_data, write_dataset, read_scalar, eltype, commit, etc.) - but also highlighted getproperty and setproperty which surprised me a little. I'll need to study your tricks in the commit here to understand the actions you take to remedy the problems; but it's a very nice and complete casestudy - thanks again!
Thanks also bunch for taking the time to explain your further in #340 and in particular for pointing out that:
- I should remove the precompile statements before trying the despecialization; makes a bunch of sense...
- The
by=exclusivekwarg topgdsgui
One quick question (if you have the time): in #340, you had a line Profile.init(delay=0.01, n=10^7), i.e. you increase the time delay between backtraces - what's the motivation for this? Should one generally be reducing this for SnoopCompile, or?
As a practical matter, I think this seems very worthwhile to do - and also to remove the precompile statements I had added (I saw your "prioritized steps list"; makes sense!).
but also highlighted getproperty and setproperty which surprised me a little.
These you need to ignore. Every struct that has an obj.x access calls getproperty and constant-specializes them, but they show up as inference workloads. No good way around it short of making inference faster (which may be difficult/impossible), I'm afraid.
you increase the time delay between backtraces - what's the motivation for this?
Any time you want to collect more snapshots than there is room for in the profile buffer, you might consider this. I typically leave it where it is until a workload takes ~10s, and then I start dropping it. When running the DataFrames test suite I only take a snapshot every 100ms.
Oh, and we should definitely restore precompilation eventually, but this is going to change a lot of MethodInstances, so best to re-run that analysis later. And I do find that the cross-Julia-versions issues seem to go away if you do precompilation by executing a workload. You also neatly solve 32-bit vs 64-bit CPU architectures, something that's a really hard problem from the standpoint of precompile directives.