WIP: Add `.bcif` support
This is discussed somewhat in #45.
Still very much a WIP, but has initial support for parsing a .bcif file into a MolecularStructure. Based on the BCIF Spec and the biotite implementation.
This is my first sizeable Julia project, so I would like feedback on ways things could be better approached in a more "Julia-esque" way. Currently the .bcif file is unpacked using MsgPack and just left as a dictionary that is then processed and decoded at different steps. Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?
Read function is bare-bones but works. I am still getting some errors for the construction of the MolecularStructure for some files which I am not sure how we approach them:
TODOS:
- [ ] Refactor multiple dispatch for decoding into single larger function
- [x] DSSP / Stride support
- [ ] extract SS information when opening
- [x] use
MsgPack.unpack(io, BCIFDict) - [x] cleanup superfluous function definitions and dictionary lookups
Benchmarks
Some initial bencharks that I'll make some plots for & be more comprehensive when closer to being finished. Benchmarks run with julia --threads 8 as the .bcif columns are returned all together so can be decoded in parallel.
Unsurprisingly it is slightly slower on the small structures, with minimal difference in file sizes on disk. For the larger structures it ends up being about ~2x faster to return the MolecularStructure and with 1/2 the memory required for reading. Decoding the file into a dictionary for the largest structure tested (8J07) only takes ~400 ms with 1.5 GB memory (vs 6.5 s & 4 GB for returning full structure) with the remaining ~6 s being the construction of the MolecularStructure.
1AKE
CIF Size: 427 KB
julia> @benchmark read("1AKE.cif", MMCIFFormat)
BenchmarkTools.Trial: 483 samples with 1 evaluation per sample.
Range (min … max): 9.590 ms … 18.404 ms ┊ GC (min … max): 0.00% … 46.39%
Time (median): 9.969 ms ┊ GC (median): 0.00%
Time (mean ± σ): 10.352 ms ± 899.304 μs ┊ GC (mean ± σ): 3.96% ± 6.87%
▃█▇▅▅▂▂▂
▃▄▇████████▄▄▃▄▃▂▂▂▁▁▃▁▁▁▁▁▁▁▁▁▃▃▂▃▃▃▄▃▄▄▃▄▄▂▄▃▃▄▃▃▂▃▁▃▂▁▁▂▃ ▃
9.59 ms Histogram: frequency by time 12.6 ms <
BCIF Size: 224 KB
julia> @benchmark read("1AKE.bcif", BCIFFormat)
BenchmarkTools.Trial: 214 samples with 1 evaluation per sample.
Range (min … max): 22.036 ms … 33.331 ms ┊ GC (min … max): 0.00% … 28.67%
Time (median): 22.674 ms ┊ GC (median): 0.00%
Time (mean ± σ): 23.419 ms ± 1.502 ms ┊ GC (mean ± σ): 3.42% ± 5.25%
▂▂▁█▁
▄▆█████▆▄▃▃▂▂▃▃▃▂▃▁▁▁▃▁▁▃▃▃▆▄▄▄▃▅▃▃▁▁▃▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
22 ms Histogram: frequency by time 28.3 ms <
Memory estimate: 15.89 MiB, allocs estimate: 277943.
1CD3
CIF Size: 1 MB
julia> @benchmark read("1CD3.cif", MMCIFFormat)
BenchmarkTools.Trial: 194 samples with 1 evaluation per sample.
Range (min … max): 23.517 ms … 35.699 ms ┊ GC (min … max): 0.00% … 29.85%
Time (median): 24.963 ms ┊ GC (median): 0.00%
Time (mean ± σ): 25.872 ms ± 2.146 ms ┊ GC (mean ± σ): 5.94% ± 7.00%
▁█▅▇▅ ▄▁ ▁
█████▆▇███▇▇▅▅▃▅▄▃▁▁▄▇▃▅▆▆▆▅▄█▄█▃▆▄▄▄▆▅▄▃▄▃▁▃▃▄▁▁▁▁▃▁▃▁▃▃▁▃ ▃
23.5 ms Histogram: frequency by time 31.6 ms <
Memory estimate: 26.06 MiB, allocs estimate: 336460.
BCIF Size: 316 KB
julia> @benchmark read("1CD3.bcif", BCIFFormat)
BenchmarkTools.Trial: 187 samples with 1 evaluation per sample.
Range (min … max): 22.042 ms … 39.896 ms ┊ GC (min … max): 0.00% … 7.66%
Time (median): 26.303 ms ┊ GC (median): 6.11%
Time (mean ± σ): 26.707 ms ± 2.701 ms ┊ GC (mean ± σ): 5.17% ± 4.21%
█▁▄▅ ▃▄ ▆▄▂ ▂▁
▄▁▇▇▁▅▅▅▇▆▆████▆██▄███▆▅██▆▅▃▁▅▁▄▆▄▇▁▅▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▃
22 ms Histogram: frequency by time 36.5 ms <
Memory estimate: 24.05 MiB, allocs estimate: 431307.
7CGO
CIF Size: 36.3 MB
julia> @benchmark read("7CGO.cif", MMCIFFormat)
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
Range (min … max): 1.067 s … 1.254 s ┊ GC (min … max): 8.48% … 22.51%
Time (median): 1.159 s ┊ GC (median): 12.82%
Time (mean ± σ): 1.172 s ± 78.064 ms ┊ GC (mean ± σ): 15.69% ± 5.87%
█ █ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
1.07 s Histogram: frequency by time 1.25 s <
Memory estimate: 841.24 MiB, allocs estimate: 11335780.
BCIF Size: 4.2 MB
julia> @benchmark read("7CGO.bcif", BCIFFormat)
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
Range (min … max): 530.286 ms … 604.980 ms ┊ GC (min … max): 7.25% … 16.83%
Time (median): 572.595 ms ┊ GC (median): 12.51%
Time (mean ± σ): 573.714 ms ± 20.675 ms ┊ GC (mean ± σ): 12.92% ± 2.96%
█ █ █ ██ █ █ █ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁██▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
530 ms Histogram: frequency by time 605 ms <
Memory estimate: 455.95 MiB, allocs estimate: 8453142.
8J07
CIF Size: 353.3 MB
julia> @benchmark read("8J07.cif", MMCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
Single result which took 12.902 s (14.72% GC) to evaluate,
with a memory estimate of 8.26 GiB, over 106536150 allocations.
BCIF Size 37.5 MB
julia> @benchmark read("8J07.bcif", BCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
Single result which took 6.537 s (8.73% GC) to evaluate,
with a memory estimate of 3.94 GiB, over 74444193 allocations.
Just decoding to Dict()
julia> @benchmark datablock_to_dict(MsgPack.unpack(read("8J07.bcif"))["dataBlocks"][1])
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
Range (min … max): 366.401 ms … 424.174 ms ┊ GC (min … max): 23.95% … 26.69%
Time (median): 395.279 ms ┊ GC (median): 31.13%
Time (mean ± σ): 396.415 ms ± 22.352 ms ┊ GC (mean ± σ): 30.22% ± 3.46%
█ █ ▁ ▁ ▁ ▁ ▁▁ ▁ █
█▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁█▁▁█ ▁
366 ms Histogram: frequency by time 424 ms <
Memory estimate: 1.74 GiB, allocs estimate: 19754142.
Codecov Report
Attention: Patch coverage is 81.56425% with 33 lines in your changes missing coverage. Please review.
Project coverage is 94.24%. Comparing base (
a507527) to head (080ad27).
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/bcif.jl | 80.47% | 33 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #71 +/- ##
==========================================
- Coverage 95.32% 94.24% -1.09%
==========================================
Files 14 15 +1
Lines 2033 2205 +172
==========================================
+ Hits 1938 2078 +140
- Misses 95 127 +32
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
Ah sorry for the auto-format spam. Will go through comments and now I know you're happy with overall I'll put together some tests / benchmarks as well.
Thanks for the feedback @timholy and @jgreener64 . I've gone through with some refactoring and incorporated most of it and it's much cleaner. I've got very minor test currently, but over the coming days I'll continue to add more tests for BCIF-specific things. I've currently got the encode() functions commented out as I haven't worked on them at all since the initial write up.
@timholy are you able to maybe explain if I should still be trying to attempt this? There has been a bit of a restructure, but with the decoding process there can be a different encoding type, but same final type (mostly with integers), so specifying the type I don't think would work. I might be misunderstanding this though.
function decode(typecode::TypeCode, data) if typecode == INT8 return ... elseif typecode == INT16 return ... ... end
Looking good! What I meant was the following:
function reencode1!(out, types::Vector{DataType}, vals)
for (T, v) in zip(types, vals)
push!(out, T(v))
end
return out
end
function reencode2!(out, typecodes::Vector{Int}, vals)
for (nb, v) in zip(typecodes, vals)
if nb == 8
push!(out, Int8(v))
elseif nb == 16
push!(out, Int16(v))
elseif nb == 32
push!(out, Int32(v))
elseif nb == 64
push!(out, Int64(v))
end
end
return out
end
vals = 1:16
types = repeat([Int8, Int16, Int32, Int64], 4)
typecodes = repeat([8, 16, 32, 64], 4)
out = sizehint!(Integer[], length(vals))
using BenchmarkTools
@btime reencode1!($out, $types, $vals) setup=(empty!(out));
@btime reencode2!($out, $typecodes, $vals) setup=(empty!(out));
Result:
julia> include("bench.jl")
1.300 μs (0 allocations: 0 bytes)
74.794 ns (0 allocations: 0 bytes)
As you can see, the second mostly-type-stable version is much faster.
Okay thanks for the additional info on that. I'll have a play around with some benchmarking and see what I can come up with.