Discussion: Make `LongSequence` fixed-length?
This is obviously a breaking change, so this is for the far future if it will ever happen.
Currently, LongSequence is resizable: They support operations like:
julia> seq = dna"TAG";
julia> push!(seq, DNA_A)
4nt DNA Sequence:
TAGA
julia> append!(seq, rna"UAGA")
8nt DNA Sequence:
TAGATAGA
My proposal is to remove all methods on LongSequence that changes their size. This includes: resize!, pop!, popfirst!, push!, pushfirst!, filter!, deleteat!, and resizing during copy!.
Disadvantages of proposal
The disadvantages are obvious: Some users may want to do these operations. With my proposed suggestion, users would instead need to use immutable operations, just like if they were working with strings. But: How often do people actually use these operations? I would guess that they are not used very often.
Advantage
It would allow us to change the storage of LongSequence from:
mutable struct OldSeq
const data::Vector{UInt64}
len::UInt
end
struct NewSeq
data::Memory{UInt64}
len::UInt
end
That is, it would allow making LongSequence a struct instead of a mutable struct, and it would also allow them to use Memory instead of Vector as backing storage. This saves two memory indirections.
More importantly, these indirections may inhibit memory optimisations such as stack-allocating some LongSequence, allocation hoisting, such. Such optimisations will be easier for Julia to do on an immutable struct containing Memory compared to a Memory hiding behind to mutable references.
At the moment, Julia doesn't really have any substantial memory optimisations so at the moment, there won't be much advantage to it.
At the moment, Julia doesn't really have any substantial memory optimisations so at the moment, there won't be much advantage to it.
If and when this stops being true, I'd be excited about having these changes and helping out where possible.
I personally never use the resizing and conceptually think of BioSequences like bio-flavored Strings with the appropriate validation checks and functions. For example, Instead of ASCII or UniCode alphabets with uppercasing and lowercasing functions the alphabets are nucleic acids or amino acids with (reverse)complement and/or (reverse)translation functions, etc.
If Strings in base Julia are already immutable as you suggest here:
With my proposed suggestion, users would instead need to use immutable operations, just like if they were working with strings.
then making this change would just make BioSequences more consistent with how I (personally, may not be universal) use the library
Related, my personal experience to:
But: How often do people actually use these operations? I would guess that they are not used very often.
is rarely if ever
Just my perspective but hopefully it is helpful. Thanks as always for your work on BioJulia @jakobnissen !
@jakobnissen Do we have a decent list of dependents? We could maybe use juliahub to query this, or use some of the work that Timothy has been doing. If we could add downstream tests, we could see how breaking this actually is.
I agree it unlikely that this is used frequently.
We have downstream tests already, so we coild just test it out (of course and then do more tests if it looks promising)
Hi
A biologist's point of view:
Many molecular biologists and population geneticists are concerned with sequence variations (SNPs, indels, CNVs...). SequenceVariation.jl is instrumental to manage efficiently short variant sequences. In case of deletion or insertion, the size of the sequence is obviously changed. I am preparing a package that will use hastables to map reference sequences and variants thereof to a limited number of loci on the human genome. If Biosequences are not resizable in future versions, I will have to rewrite everything. I will most likely useStrings as , unfortunately, NGS sequencers output text files (FASTA, FASTQ) and not 2/4-bit files and regex are quite efficient.
Thanks for the great job done for Biosequences.
I'm not sure I follow. Are the ability to resize biosequences important to represent variations? You could easily have a type that represents an indel without actually doing an operaton that changes the length of an existing biosequence. From what I can tell, SequenceVariation.jl does not actually do any operations that modify the length of a BioSequence.
Then you write you would prefer to use strings. But strings are also fixed-length (worse - they are immutable!). So I don't understand how switching to strings would help that issue.
Finally, (slightly offtopic, perhaps), check out the package FASTX.jl for working with FASTA + FASTQ files in Julia, and bio-regex for using regex with a BioSequence.
Sorry for not being clear enough.
When I get sequences from a FASTQ file (using FASTX), I get some that include deletions or insertions compared to the human genome reference sequence. I want to map those sequences to a limited number of loci on the genome using a hashtable (a la Novoalign), which, in my use case, is much faster than the classical kmer-based approach. This implies to precompute all observable variant sequences, including insertions and deletions. The variant sequences are actually stored as (reference sequence, variations) but I have to reconstruct the variant sequences when constructing the dictionary (sequence => locus).
For the Strings, you are right and, sorry, it is certainly offtopic.
It seems that the arbitrary length of LongSequence has its place given the existence of the Kmers.jl package.
P.S. Just adding this for context (@jakobnissen is clearly aware of this - since he contributed to the Kmers.jl package).