BitInformation.jl
BitInformation.jl copied to clipboard
Bitinformation of masked data
As requested by @aaronspring in #25, we need to extend the bitinformation
function with a method for masked arrays. This is an issue to discuss a potential implementation of this and to discuss progress. The methods we would need are
function bitinformation(A::AbstractArray{NF}, # input array A
mask::BitArray, # mask of A
kwargs...) where {T<:Union{Integer,AbstractFloat}}
Then I further suggest separating out the permutation for information along dimension dim
,
function roll_dim_forward(A::AbstractArray,dim::Int)
which is basically this bit https://github.com/milankl/BitInformation.jl/blob/fa2dd50171a48565ef1dc5a3c7268c19ffe844f1/src/mutual_information.jl#L66-L73
and moving the zeroing of insignificant information into its own function, like so
function set_insignificant_zero!(H::Vector,nelements::Int,confidence::Real)
which should basically include these lines https://github.com/milankl/BitInformation.jl/blob/fa2dd50171a48565ef1dc5a3c7268c19ffe844f1/src/mutual_information.jl#L48-L54
This issues will be addressed in #30
I've added more tests in #30 and merged the PR as all of them pass. I'd still love to see the impact if you could share your results @aaronspring at some point. Many thanks
Originally posted by @aaronspring in https://github.com/milankl/BitInformation.jl/issues/30#issuecomment-1080539133
@milankl I find differences whether using mask
or not, specially the first bits and within 99-100% information:
dim: 4 time
from https://gist.github.com/aaronspring/5de0bc6be5a8d547f3503ff8b1aef8c6
dim: 1 x
from https://gist.github.com/aaronspring/7b0675e36467e5c647f2fe3f546d4bf5
dim analysis:
from https://gist.github.com/aaronspring/52662dd885eebfd6ce4b88939be016c6
Thanks Aaron, that looks awesome. Good to see that all these information patches in the exponent bits disappear! For dissicos
could you convert your arrays to signed exponent bits? It looks like there's a bunch of exponent bits that simultaneously flips over (which happens when your data covers a range across floating-point 2)
julia> A = rand(Float32,3,3)
3×3 Matrix{Float32}:
0.408687 0.922863 0.0622634
0.838222 0.947521 0.141393
0.0944574 0.991588 0.576514
julia> signed_exponent(A)
3×3 Matrix{Float32}:
13.078 7.3829 127.515
6.70577 7.58017 18.0983
48.3622 7.93271 4.61211
It looks wrong, because Julia will interpret the exponent bits still as being biased (the information that the exponent bits are now to be interpreted differently is not stored), but you can always check that nothing went wrong by applying the inverse biased_exponent
. If you use signed_exponent
as a preprocessing step, note that also your mask will change
julia> signed_exponent([-9e-33])
1-element Vector{Float64}:
-4.739053125085073e32
Originally posted by @milankl in https://github.com/milankl/BitInformation.jl/issues/30#issuecomment-1080575806
as you say, signed_exponent
added makes the black bar in dissicos
disappear
Perfect, that looks really good to me. Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.
We haven't addressed the interpolation problem due to the ocean-atmosphere coupling, I'd be curious whether this has any consquences.
could you point me to an example how to save a NetCDF
inside julia with one of your compressions? I only find sizeof
but no writing to disk
Use NetCDF.jl, but switch lossless compression on, like here. If you then pass on arrays with rounded mantissas via round(::Array,keepbits::Int)
from BitInformation.jl you'll get it accordingly compressed. Alternatively, use nco like
ncks --baa=8 --ppc varname=3 file.nc file_compressed.nc
where baa=8
is bitrounding, varname=3
means 3 mantissa bits to keep, varname
is the name of the nc variables, then input and output file. See this thread too https://github.com/nco/nco/issues/250#issuecomment-1005268313
Branch #main now contains bitinformation(::AbstractArray{T};masked_value::T) where T
so that you should be able to do directly
julia> using BitInformation
julia> A = rand(Float32,3,3);
julia> round!(A,1)
3×3 Matrix{Float32}:
0.5 0.1875 0.25
1.0 0.046875 0.0625
0.25 0.5 0.75
julia> bitinformation(A,masked_value=0.5f0)
note that masked_value has to be of the same eltype
as the array. That means if eltype(A)=Float32
then masked_value=0.5
will throw an error. However, I've used ===
to create the mask, so once #34 is merged you can even have NaNs in your data and simply use bitinformation(A,masked_value=NaN)
(or NaN32) although == comparison for NaN always yields false 😉
v0.5 got merged into the Julia registry now. Will leave this issue open to eventually pick some of the posts here for the documentation.
Now I'd keep these mantissa bits and then redo any analysis that you'd otherwise do with the full precision data set. Feel free to post any of that here.
now with the masking enabled, I am quite happy with the bitrounding results: compared mean and std on monthly and annualized output as well as variance-weighted mean period as my variability analysis. https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574
https://gist.github.com/aaronspring/383dbbfe31baa4618c5b0dbef4f6d574
Many thanks for sharing this. Makes me very happy to see that (once we've sorted out the masked array issue, which I wanted to do for a long time anyway!) we basically have another example where the bitwise real information content-approach just works. Sure you are left with the choice of 99% ... to 100% of information, but I believe that this will always be the case and in general dependent on your storage requirements. E.g. is 6x compression enough or will 10x be a game changer (because you can store more ensemble members, a higher temporal resolution, etc)? Is it data that should be kept for archiving purposes but won't be touched much?
If you have any other remarks, suggestions, concerns etc. feel free to reach out here!
Is it data that should be kept for archiving purposes but won't be touched much?
I just need to archive some data and probably wont touch it again.
But I think the potential for bitrounding is enormous. I will soon try it on high spatial resolution data. I expect much higher compression gains there. I am going to pitch my results in our ocean group meeting in early may.
But I think the potential for bitrounding is enormous.
I agree 😄 Especially once netcdf supports more lossless codecs, I believe zstd is soon-ish going to be supported (discussion here https://github.com/Unidata/netcdf-c/discussions/2227) which likely yields even higher compression factors in the "archiving won't touch much" case, or higher performance in the "we'll read this data every day" case.