BioFSharp
BioFSharp copied to clipboard
[Feature Request] Rework alignment
The Problem
Using the Pairwise alignment in BioFSharp.Algorithms works fine but the only implemented way to write out this alignment in a correct format is in the BioFSharp.IO.Clustal module. Although both generally use the same BioFSharp.Alignment.Alignment type, the conversion can be quite cumbersome.
Solution
Remodel BioFSharp.Algorithms.Pairwise Alignment and BioFSharp.IO.Clustal
-
[ ] Add ConservationInfo module to BioFSharp.IO.Clustal or BioFSharp.Alignment
-
[ ] Let Clustal functions use BioSeqs instead of Strings
-
[ ] Let BioFSharp.Algorithms.PairwiseAlignment functions use BioSeqs as output instead of Nucleotides
-
[x] Add create function to Alignment Type in BioFSharp.Alignment
These changes should make using the different alignment functions of different namespaces together easier.
Example of unnecessary conversions
Output type of alignment
Alignment.Alignment<Nucleotides.Nucleotide list, Algorithm.PairwiseAlignment.Score>
Expected input of clustal write function
Alignment.Alignment<BioID.TaggedSequence<string,char>,Clustal.AlignmentInfo>
Needed Conversion
let mappedData =
alignment.AlignedSequences
|> List.mapi (fun i (ns:Nucleotides.Nucleotide list) ->
Seq.map (BioItem.symbol) ns
|> BioID.createTaggedSequence (sprintf "seq%i" i)
)
let conservationInfo = String.init firstGeneSeq.Length (fun _ -> "*")
let newHeader = {Header = "Decoy";ConservationInfo = conservationInfo}
let newAlignment = {MetaData = newHeader;AlignedSequences = mappedData}
which is very cumbersome
@HLWeil any updates?
Actually there are multiple types that more or less look very similar:
type TaggedSequence<'T,'S> =
{
Tag: 'T;
Sequence: seq<'S>
}
type FastaItem<'a> = {
Header : string;
Sequence : 'a;
}
///General Alignment type used throughout BioFSharp
type Alignment<'Sequence,'Metadata> =
{
///Additional information for this alignment
MetaData : 'Metadata;
///List of aligned Sequences
Sequences : seq<'Sequence>;
}
Replacing the Alignment
type with the TaggedSequence
might actually cause conciseness loss, but I think in general it would be good if these types had seamless interop.
Also the FastaItem
type might actually be replacable with the TaggedSequence
type with some minor adjustments.
What do you think?