BioSequences.jl Discussion: Make `LongSequence` fixed-length?

This is obviously a breaking change, so this is for the far future if it will ever happen.

Currently, LongSequence is resizable: They support operations like:

julia> seq = dna"TAG";

julia> push!(seq, DNA_A)
4nt DNA Sequence:
TAGA

julia> append!(seq, rna"UAGA")
8nt DNA Sequence:
TAGATAGA

My proposal is to remove all methods on LongSequence that changes their size. This includes: resize!, pop!, popfirst!, push!, pushfirst!, filter!, deleteat!, and resizing during copy!.

Disadvantages of proposal

The disadvantages are obvious: Some users may want to do these operations. With my proposed suggestion, users would instead need to use immutable operations, just like if they were working with strings. But: How often do people actually use these operations? I would guess that they are not used very often.

Advantage

It would allow us to change the storage of LongSequence from:

mutable struct OldSeq
    const data::Vector{UInt64}
    len::UInt
end

struct NewSeq
    data::Memory{UInt64}
    len::UInt
end

That is, it would allow making LongSequence a struct instead of a mutable struct, and it would also allow them to use Memory instead of Vector as backing storage. This saves two memory indirections. More importantly, these indirections may inhibit memory optimisations such as stack-allocating some LongSequence, allocation hoisting, such. Such optimisations will be easier for Julia to do on an immutable struct containing Memory compared to a Memory hiding behind to mutable references.

At the moment, Julia doesn't really have any substantial memory optimisations so at the moment, there won't be much advantage to it.

Dec 22 '24 15:12 jakobnissen

At the moment, Julia doesn't really have any substantial memory optimisations so at the moment, there won't be much advantage to it.

If and when this stops being true, I'd be excited about having these changes and helping out where possible.

I personally never use the resizing and conceptually think of BioSequences like bio-flavored Strings with the appropriate validation checks and functions. For example, Instead of ASCII or UniCode alphabets with uppercasing and lowercasing functions the alphabets are nucleic acids or amino acids with (reverse)complement and/or (reverse)translation functions, etc.

If Strings in base Julia are already immutable as you suggest here:

With my proposed suggestion, users would instead need to use immutable operations, just like if they were working with strings.

then making this change would just make BioSequences more consistent with how I (personally, may not be universal) use the library

Related, my personal experience to:

But: How often do people actually use these operations? I would guess that they are not used very often.

is rarely if ever

Just my perspective but hopefully it is helpful. Thanks as always for your work on BioJulia @jakobnissen !

Dec 22 '24 20:12 cjprybol

@jakobnissen Do we have a decent list of dependents? We could maybe use juliahub to query this, or use some of the work that Timothy has been doing. If we could add downstream tests, we could see how breaking this actually is.

I agree it unlikely that this is used frequently.

Dec 23 '24 15:12 kescobo

We have downstream tests already, so we coild just test it out (of course and then do more tests if it looks promising)

Dec 23 '24 15:12 jakobnissen

Hi A biologist's point of view: Many molecular biologists and population geneticists are concerned with sequence variations (SNPs, indels, CNVs...). SequenceVariation.jl is instrumental to manage efficiently short variant sequences. In case of deletion or insertion, the size of the sequence is obviously changed. I am preparing a package that will use hastables to map reference sequences and variants thereof to a limited number of loci on the human genome. If Biosequences are not resizable in future versions, I will have to rewrite everything. I will most likely useStrings as , unfortunately, NGS sequencers output text files (FASTA, FASTQ) and not 2/4-bit files and regex are quite efficient. Thanks for the great job done for Biosequences.

Jan 24 '25 11:01 ljournot

I'm not sure I follow. Are the ability to resize biosequences important to represent variations? You could easily have a type that represents an indel without actually doing an operaton that changes the length of an existing biosequence. From what I can tell, SequenceVariation.jl does not actually do any operations that modify the length of a BioSequence.

Then you write you would prefer to use strings. But strings are also fixed-length (worse - they are immutable!). So I don't understand how switching to strings would help that issue.

Finally, (slightly offtopic, perhaps), check out the package FASTX.jl for working with FASTA + FASTQ files in Julia, and bio-regex for using regex with a BioSequence.

Jan 24 '25 14:01 jakobnissen

Sorry for not being clear enough. When I get sequences from a FASTQ file (using FASTX), I get some that include deletions or insertions compared to the human genome reference sequence. I want to map those sequences to a limited number of loci on the genome using a hashtable (a la Novoalign), which, in my use case, is much faster than the classical kmer-based approach. This implies to precompute all observable variant sequences, including insertions and deletions. The variant sequences are actually stored as (reference sequence, variations) but I have to reconstruct the variant sequences when constructing the dictionary (sequence => locus). For the Strings, you are right and, sorry, it is certainly offtopic.

Jan 24 '25 15:01 ljournot

It seems that the arbitrary length of LongSequence has its place given the existence of the Kmers.jl package.

P.S. Just adding this for context (@jakobnissen is clearly aware of this - since he contributed to the Kmers.jl package).

Jun 03 '25 15:06 algunion