BioStructures.jl icon indicating copy to clipboard operation
BioStructures.jl copied to clipboard

WIP: Add `.bcif` support

Open BradyAJohnston opened this issue 8 months ago • 6 comments

This is discussed somewhat in #45.

Still very much a WIP, but has initial support for parsing a .bcif file into a MolecularStructure. Based on the BCIF Spec and the biotite implementation.

This is my first sizeable Julia project, so I would like feedback on ways things could be better approached in a more "Julia-esque" way. Currently the .bcif file is unpacked using MsgPack and just left as a dictionary that is then processed and decoded at different steps. Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?

Read function is bare-bones but works. I am still getting some errors for the construction of the MolecularStructure for some files which I am not sure how we approach them:

TODOS:

  • [ ] Refactor multiple dispatch for decoding into single larger function
  • [x] DSSP / Stride support
  • [ ] extract SS information when opening
  • [x] use MsgPack.unpack(io, BCIFDict)
  • [x] cleanup superfluous function definitions and dictionary lookups

Benchmarks

Some initial bencharks that I'll make some plots for & be more comprehensive when closer to being finished. Benchmarks run with julia --threads 8 as the .bcif columns are returned all together so can be decoded in parallel.

Unsurprisingly it is slightly slower on the small structures, with minimal difference in file sizes on disk. For the larger structures it ends up being about ~2x faster to return the MolecularStructure and with 1/2 the memory required for reading. Decoding the file into a dictionary for the largest structure tested (8J07) only takes ~400 ms with 1.5 GB memory (vs 6.5 s & 4 GB for returning full structure) with the remaining ~6 s being the construction of the MolecularStructure.

1AKE

CIF Size: 427 KB

julia> @benchmark read("1AKE.cif", MMCIFFormat)
BenchmarkTools.Trial: 483 samples with 1 evaluation per sample.
 Range (min … max):   9.590 ms …  18.404 ms  ┊ GC (min … max): 0.00% … 46.39%
 Time  (median):      9.969 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.352 ms ± 899.304 μs  ┊ GC (mean ± σ):  3.96% ±  6.87%

     ▃█▇▅▅▂▂▂                                                   
  ▃▄▇████████▄▄▃▄▃▂▂▂▁▁▃▁▁▁▁▁▁▁▁▁▃▃▂▃▃▃▄▃▄▄▃▄▄▂▄▃▃▄▃▃▂▃▁▃▂▁▁▂▃ ▃
  9.59 ms         Histogram: frequency by time         12.6 ms <

BCIF Size: 224 KB

julia> @benchmark read("1AKE.bcif", BCIFFormat)
BenchmarkTools.Trial: 214 samples with 1 evaluation per sample.
 Range (min … max):  22.036 ms … 33.331 ms  ┊ GC (min … max): 0.00% … 28.67%
 Time  (median):     22.674 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.419 ms ±  1.502 ms  ┊ GC (mean ± σ):  3.42% ±  5.25%

    ▂▂▁█▁                                                      
  ▄▆█████▆▄▃▃▂▂▃▃▃▂▃▁▁▁▃▁▁▃▃▃▆▄▄▄▃▅▃▃▁▁▃▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  22 ms           Histogram: frequency by time        28.3 ms <

 Memory estimate: 15.89 MiB, allocs estimate: 277943.

1CD3

CIF Size: 1 MB

julia> @benchmark read("1CD3.cif", MMCIFFormat)
BenchmarkTools.Trial: 194 samples with 1 evaluation per sample.
 Range (min … max):  23.517 ms … 35.699 ms  ┊ GC (min … max): 0.00% … 29.85%
 Time  (median):     24.963 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.872 ms ±  2.146 ms  ┊ GC (mean ± σ):  5.94% ±  7.00%

  ▁█▅▇▅   ▄▁                   ▁                               
  █████▆▇███▇▇▅▅▃▅▄▃▁▁▄▇▃▅▆▆▆▅▄█▄█▃▆▄▄▄▆▅▄▃▄▃▁▃▃▄▁▁▁▁▃▁▃▁▃▃▁▃ ▃
  23.5 ms         Histogram: frequency by time        31.6 ms <

 Memory estimate: 26.06 MiB, allocs estimate: 336460.

BCIF Size: 316 KB

julia> @benchmark read("1CD3.bcif", BCIFFormat)
BenchmarkTools.Trial: 187 samples with 1 evaluation per sample.
 Range (min … max):  22.042 ms … 39.896 ms  ┊ GC (min … max): 0.00% … 7.66%
 Time  (median):     26.303 ms              ┊ GC (median):    6.11%
 Time  (mean ± σ):   26.707 ms ±  2.701 ms  ┊ GC (mean ± σ):  5.17% ± 4.21%

             █▁▄▅ ▃▄ ▆▄▂  ▂▁                                   
  ▄▁▇▇▁▅▅▅▇▆▆████▆██▄███▆▅██▆▅▃▁▅▁▄▆▄▇▁▅▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▃
  22 ms           Histogram: frequency by time        36.5 ms <

 Memory estimate: 24.05 MiB, allocs estimate: 431307.

7CGO

CIF Size: 36.3 MB

julia> @benchmark read("7CGO.cif", MMCIFFormat)
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
 Range (min … max):  1.067 s …   1.254 s  ┊ GC (min … max):  8.48% … 22.51%
 Time  (median):     1.159 s              ┊ GC (median):    12.82%
 Time  (mean ± σ):   1.172 s ± 78.064 ms  ┊ GC (mean ± σ):  15.69% ±  5.87%

  █                    █     █                         █  █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
  1.07 s         Histogram: frequency by time        1.25 s <

 Memory estimate: 841.24 MiB, allocs estimate: 11335780.

BCIF Size: 4.2 MB

julia> @benchmark read("7CGO.bcif", BCIFFormat)
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
 Range (min … max):  530.286 ms … 604.980 ms  ┊ GC (min … max):  7.25% … 16.83%
 Time  (median):     572.595 ms               ┊ GC (median):    12.51%
 Time  (mean ± σ):   573.714 ms ±  20.675 ms  ┊ GC (mean ± σ):  12.92% ±  2.96%

  █                          █  █  ██    █    █    █          █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁██▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  530 ms           Histogram: frequency by time          605 ms <

 Memory estimate: 455.95 MiB, allocs estimate: 8453142.

8J07

CIF Size: 353.3 MB

julia> @benchmark read("8J07.cif", MMCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 12.902 s (14.72% GC) to evaluate,
 with a memory estimate of 8.26 GiB, over 106536150 allocations.

BCIF Size 37.5 MB

julia> @benchmark read("8J07.bcif", BCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 6.537 s (8.73% GC) to evaluate,
 with a memory estimate of 3.94 GiB, over 74444193 allocations.

Just decoding to Dict()

julia> @benchmark datablock_to_dict(MsgPack.unpack(read("8J07.bcif"))["dataBlocks"][1])
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  366.401 ms … 424.174 ms  ┊ GC (min … max): 23.95% … 26.69%
 Time  (median):     395.279 ms               ┊ GC (median):    31.13%
 Time  (mean ± σ):   396.415 ms ±  22.352 ms  ┊ GC (mean ± σ):  30.22% ±  3.46%

  █      █            ▁   ▁     ▁  ▁                   ▁▁  ▁  █  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁█▁▁█ ▁
  366 ms           Histogram: frequency by time          424 ms <

 Memory estimate: 1.74 GiB, allocs estimate: 19754142.

BradyAJohnston avatar May 09 '25 05:05 BradyAJohnston

Codecov Report

Attention: Patch coverage is 81.56425% with 33 lines in your changes missing coverage. Please review.

Project coverage is 94.24%. Comparing base (a507527) to head (080ad27).

Files with missing lines Patch % Lines
src/bcif.jl 80.47% 33 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
- Coverage   95.32%   94.24%   -1.09%     
==========================================
  Files          14       15       +1     
  Lines        2033     2205     +172     
==========================================
+ Hits         1938     2078     +140     
- Misses         95      127      +32     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 09 '25 05:05 codecov[bot]

Ah sorry for the auto-format spam. Will go through comments and now I know you're happy with overall I'll put together some tests / benchmarks as well.

BradyAJohnston avatar May 13 '25 08:05 BradyAJohnston

Thanks for the feedback @timholy and @jgreener64 . I've gone through with some refactoring and incorporated most of it and it's much cleaner. I've got very minor test currently, but over the coming days I'll continue to add more tests for BCIF-specific things. I've currently got the encode() functions commented out as I haven't worked on them at all since the initial write up.

BradyAJohnston avatar May 15 '25 07:05 BradyAJohnston

@timholy are you able to maybe explain if I should still be trying to attempt this? There has been a bit of a restructure, but with the decoding process there can be a different encoding type, but same final type (mostly with integers), so specifying the type I don't think would work. I might be misunderstanding this though.

function decode(typecode::TypeCode, data)
    if typecode == INT8
        return ...
    elseif typecode == INT16
        return ...
    ...
end

BradyAJohnston avatar May 15 '25 12:05 BradyAJohnston

Looking good! What I meant was the following:

function reencode1!(out, types::Vector{DataType}, vals)
    for (T, v) in zip(types, vals)
        push!(out, T(v))
    end
    return out
end

function reencode2!(out, typecodes::Vector{Int}, vals)
    for (nb, v) in zip(typecodes, vals)
        if nb == 8
            push!(out, Int8(v))
        elseif nb == 16
            push!(out, Int16(v))
        elseif nb == 32
            push!(out, Int32(v))
        elseif nb == 64
            push!(out, Int64(v))
        end
    end
    return out
end

vals = 1:16
types = repeat([Int8, Int16, Int32, Int64], 4)
typecodes = repeat([8, 16, 32, 64], 4)
out = sizehint!(Integer[], length(vals))

using BenchmarkTools

@btime reencode1!($out, $types, $vals) setup=(empty!(out));
@btime reencode2!($out, $typecodes, $vals) setup=(empty!(out));

Result:

julia> include("bench.jl")
  1.300 μs (0 allocations: 0 bytes)
  74.794 ns (0 allocations: 0 bytes)

As you can see, the second mostly-type-stable version is much faster.

timholy avatar May 15 '25 20:05 timholy

Okay thanks for the additional info on that. I'll have a play around with some benchmarking and see what I can come up with.

BradyAJohnston avatar May 16 '25 00:05 BradyAJohnston