BioSequences.jl Design of hash(::BioSequence). What to do?

Design of hash(::BioSequence). What to do?

Open jakobnissen opened this issue 2 years ago • 18 comments

So the design of BioSequences makes it difficult to implement efficient and correct hashing. We want efficient hashing, because hashing underlies operations like putting stuff in a Set or Dict, which users expect to be fast.

The issue is

Julia requires that isequal(a, b) === isequal(hash(a, x), hash(b, x))
isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

So, to follow these two rules

Different subtypes of BioSequence should be isequal if they have the same content, i.e. isequal(dna"TAG", DNAKmer("TAG"))
Which implies they should hash equally,

Now, how do we get two BioSequences with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!

Incidentally, the current implementation of hash for LongSequence is broken:

julia> (a, b) = (LongDNA{4}("A"), LongDNA{2}("A"));

julia> a in [b] # because they are equal
true

julia> a in Set([b]) # hashes wrong
false

Here are some possible solutions:

Just implement hashing by hashing every element. This will make hashing significantly slower (in tests, ~75 times slower), but it will be simple and correct.
Make one encoding privileged, e.g. LongSequence's. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recode LongSequence{NucleicAcidAlphabet{2}}.

There may be other solutions.

Jul 05 '22 08:07 jakobnissen

Incidentally, the current implementation of hash for LongSequence is broken

:grimacing: This is not good. We might want a separate issue to get a bug-fix on that specifically.

Here are some possible solutions:

Regarding this, I have a question - is changing the content of a hash breaking, or only the behavior. That is, as long as we fulfill the hashing contract(s), if we change the underlying behavior to improve speed, is that breaking?

If the answer is no, I propose that we do (1) as a stop-gap to fix the current bug, since it's simple, and then we can work on further optimization separately.

isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

Isn't there some nuance about the difference between == and isequal? Do we need to worry about that?

Another (possibly orthogonal) question - do we want a kmer of a sequence to be isequal with a LongDNA of the same sequence? What about an RNA with the same bases? (I'm not certain on the former, I think probably not on the later).

I am interested in this topic because I remain interested in MinHash and the things that it enables (there's a bunch of neat microbiome work on this by Titus Brown), but I'm not super knowledgeable about the computer science of hashing. Happy to be involved though.

Jul 05 '22 13:07 kescobo

I'm pretty sure changing the value of a hash is not breaking. That's fine. Yeah, let's do option 1 as a stop-gap.

W.r.t the difference between == and isequal - yep, you're right, that's a different option: We could have different sequence types with semantially identical content not be isequal. isequal controls:

The behaviour of in (okay this is more complex, but it OUGHT to control the behaviour of in)
The behaviour of sets and dicts
The behaviour of sorting

So if we choose that option, it implies that we can have:

julia> DNAKmer{"ATG"} == dna"ATG"
true

julia> DNAKmer("ATG") in [dna"ATG"]
false

which is also weird.

Jul 05 '22 14:07 jakobnissen

Okay here are some before/after benchmarks:

length  before  after
25     20 ns    69 ns
10000   0.80 us 22.04 us

So 3x for short seqs, ~25x for long sequences. Sigh.

To speed it up, the only way I can think of is this:

For generic BioSequence, we can't do much, so just keep behaviour
LongSequence is canonical, reintroduce old behaviour. For nucleic acids, only those with 4-bit alphabets are canonical.
For 2-bit LongSequences, we can do clever bit-tricks to encode their data elements on the fly to 4-bit encodings, 32 bits at a time.
For all other biosequences (currently SubSequence and Kmer), we implement an iterator which streams the content of the sequence in a series of UInt64 which correspond to the encoding of the equivalent LongSequence. E.g. for SubSeq, this is simply the existing encoding, but bitshifted to remove the offset. For Kmer, this involves bitshifting, then reversing the elements within each 64-bit chunk.

It's doable, but quite a bit of work.

Jul 08 '22 09:07 jakobnissen

And all of this would have to be done at the same time to maintain the contract? That is, we can't deal with point 2 first, then handle point 3 and point 4 later?

Jul 08 '22 17:07 kescobo

Unfortunately, all have to be implemented simultaneously

Jul 16 '22 13:07 jakobnissen

The problem with the suggestion above is that the only BioSequences I can imagine anyone wanting to hash in large numbers (and thus where the performance really matters) are Kmers. Hashing of Kmers needs to be really fast.

Aug 06 '22 04:08 ian-small

Excellent point, you are right. We should do the iterator thing above, but prioritise kmers - that should be extremely fast.

Aug 06 '22 06:08 jakobnissen

Current hashing behaviour of Kmers is:

# TODO: Ensure this is the right way to go.
# See https://github.com/BioJulia/BioSequences.jl/pull/121#discussion_r475234270
Base.hash(x::Kmer{A,K,N}, h::UInt) where {A,K,N} = hash(x.data, h ⊻ K)

Which so it just hashes the underlying Tuple. which is implemented like so:

hash(t::Tuple, h::UInt) = hash(t[1], hash(tail(t), h))
function hash(t::Any32, h::UInt)
    out = h + tuplehash_seed
    for i = length(t):-1:1
        out = hash(t[i], out)
    end
    return out
end

Sep 03 '22 22:09 TransGirlCodes

Also maybe the descisions we make here should be added to the docs on the definition of a BioSequence, as it's really important we maintain this contract if people are going to write generic code and mix packages.

Sep 03 '22 23:09 TransGirlCodes

Trying to think of why this hasnt been a big problem for me already in practical setting, and the only answer I can think of is when use a sequence type with a given alphabet I use it for my entire scripts / application, and given Set and Dict are parametric on the sequence type, any hashing has all been with one consistent type, and so one partiular hashing scheme.

Sep 03 '22 23:09 TransGirlCodes

From discussion on uBioinfo Slack and @ctb:

yes, you were active on this issue! https://github.com/marbl/Mash/issues/27 it’s pretty easy if you have murmurhash! sourmash and mash use the same approach (mmh, seed=42) and just select the hashes differently for FracMinHash (sourmash) and MinHash (mash and sourmash), you can see an implementation of the key hashing code here https://github.com/sourmash-bio/sourmash/blob/latest/utils/compute-dna-mh-another-way.py and I’ll start a new sourmash issue to track this discussion, too.

Sep 14 '22 16:09 kescobo

On further consideration and re-reading the previous issue, this isn't super relevant for this issue - I don't think we want to pull in murmurhash as a dependency, and to be fully compatible we'd need to hash the ASCII, which doesn't make sense for Base.hash. It should be easy to put the compatibility functionality into a separate package.

Sep 14 '22 16:09 kescobo

BioSequences already implement murmur3 in the hash.jl file :)

Sep 14 '22 19:09 jakobnissen

So I've been trying to think about this for kmers, and it looks like on the fly translation of bit patterns seems sensible, and it makes sense in that promtion rules of two different alphabets - a two bit and a four bit one favour the alphabet that covers the larger set of symbols.

I'm struggling to come up with an elegant set of bit flips to do the 2bit -> 4 bit translation. The best I have so far is these two ideas:

const fourbitnucs = (UInt64(1), UInt64(2), UInt64(4), UInt64(8))

function translate_bits(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
    bits = bits
    tbits = zero(UInt64)
    @inbounds for i in 0:31
        element = (bits >> (i * 2)) & UInt64(3)
        newelement = foutbitnucs[element + one(UInt64)]
        tbits = (tbits << (i * 4)) | newelement
    end
end

function translate_bits2(bits::UInt64, from::DNAAlphabet{2}, to::DNAAlphabet{4})
    bits = bits
    tbits = zero(UInt64)
    @inbounds for i in 0:31
        element = bits & UInt64(3)
        newelement = fourbitnucs[element + one(UInt64)]
        tbits = (tbits << 4) | newelement
        bits = bits >> 2
    end
    return tbits
end

I'm not a fan of the loop though, I was hoping for a clever combination of shifts and masks as in the bit-parallel counting code.

EDIT: These functions are also wrong, as two UInt64's should be output for the single input UInt64, but you get where my thought processes were going with the translation.

EDIT EDIT: There is a 3rd option using a lookup table that accepts a UInt8 of 2-bit packed elements, and returns a UIn16 of those elements packed into 4 bits. This might be thebest option as indexing can be cheap, and by using whole bytes we can unpack 4 2-bit elements at once rather than how the functions above do one at a time, whilst keeping the lookup table fairly small.

Sep 17 '22 00:09 TransGirlCodes

A lookup table also has the advantage that plebs like me can understand it :laughing: (bit flipping still seems like black magic to me)

Sep 19 '22 13:09 kescobo

Here are some relevant bitpacking code:

@inline function four_to_two_bit(x::UInt64)::UInt32
    m1 = 0x1111111111111111
    m2 = m1 << 1
    m3 = m1 | m2
    y  = (x >> 3) & m1
    y |= (x >> 2) & m2
    y |= (x >> 1) & m3
    pack(y)
end

@inline function pack(x::UInt64)::UInt32
    m1 = 0x0f0f0f0f0f0f0f0f
    m2 = 0x00ff00ff00ff00ff
    m3 = 0x0000ffff0000ffff
    m4 = 0x00000000ffffffff
    x = (x & m1) | (x & ~m1) >> 2
    x = (x & m2) | (x & ~m2) >> 4
    x = (x & m3) | (x & ~m3) >> 8
    x = (x & m4) | (x & ~m4) >> 16
    x % UInt32
end

@inline function two_to_four_bits(x::UInt32)::UInt64
    m = 0x1111111111111111
    y = expand(x)
    z = (y & m) << 1 | (m & ~y)
    m2 = y & (m << 1)
    m2 = m2 | m2 >> 1
    (z & m2) << 2 | (z & ~m2)
end

@inline function expand(x::UInt32)::UInt64
    m1 = 0x000000000000ffff
    m2 = 0x000000ff000000ff
    m3 = 0x000f000f000f000f
    m4 = 0x0303030303030303
    y = x % UInt64
    y = (y & m1) | (y & ~m1) << 16
    y = (y & m2) | (y & ~m2) << 8
    y = (y & m3) | (y & ~m3) << 4
        (y & m4) | (y & ~m4) << 2
end

Oct 03 '22 08:10 jakobnissen

Like I said... :mage: :magic_wand:

Oct 03 '22 18:10 kescobo

I'm thinking the only solution is to just define isequal(::Kmer, ::BioSequence) = false and then let kmers do its own hashing thing.

Sep 25 '23 19:09 jakobnissen

BioSequences.jl BioSequences.jl copied to clipboard

Design of hash(::BioSequence). What to do?

BioSequences.jl
BioSequences.jl copied to clipboard