biojava icon indicating copy to clipboard operation
biojava copied to clipboard

Is there a faster usage method for the "countCompounds()" method?

Open tolgap opened this issue 9 years ago • 9 comments

I'm trying to implement an amino acid frequency calculator. I've finished it but I'm using Sequence.countCompounds and it is incredibly slow. I'm profiling this with the JVM Monitor plugin for Eclipse and most time is spent putting things in Hashes inside the SequenceMixin.countCompounds method. For illustration, using countCompounds to eventually get the AAF for a single DNA sequence of about 1000 compounds is ~20 seconds. That is an insane amount of time. Here's a quick example to reproduce this:

Read the fasta file:

public static void main(String[] args) throws IOException {
        File file = new File("genomic.fasta");
        FastaReader<DNASequence, NucleotideCompound> proxy =
                new FastaReader<DNASequence, NucleotideCompound>(
                        file,
                        new GenericFastaHeaderParser<DNASequence, NucleotideCompound>(),
                        new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet()));
        LinkedHashMap<String, DNASequence> dnaSequences = proxy.process();

        for (DNASequence dna : dnaSequences.values()) {
            Map<Frame, Sequence<AminoAcidCompound>> proteins = translateDNAString(dna.getSequenceAsString());
            StringMapWritable freqs = aminoAcidFrequency(proteins);
            System.out.println(freqs);
        }
}

The translateDNAString I use:

public static Map<Frame, Sequence<AminoAcidCompound>> translateDNAString(String value) {
        DNASequence dna = new DNASequence(value, AmbiguityDNACompoundSet.getDNACompoundSet());
        TranscriptionEngine engine = new TranscriptionEngine.Builder()
                .dnaCompounds(AmbiguityDNACompoundSet.getDNACompoundSet())
                .rnaCompounds(AmbiguityRNACompoundSet.getDNACompoundSet())
                .build();
        Frame[] sixFrames = Frame.getAllFrames();
        Map<Frame, Sequence<AminoAcidCompound>> results = engine.multipleFrameTranslation(dna, sixFrames);

        return results;
}

And these two methods are used for calculating the frequency. Here, I use countCompounds:

public static StringMapWritable aminoAcidFrequency(Map<Frame, Sequence<AminoAcidCompound>> proteins) {
        Map<String, Integer> acidCount = new LinkedHashMap<String, Integer>();

        int count;
        String key;
        for (Sequence<AminoAcidCompound> seq : proteins.values()) {
//          We only want the large proteins
            if (seq.getLength() < 120) continue;
            for (AminoAcidCompound compound : seq.getAsList()) {
                key = compound.getShortName();
//              Skip the protein stop codon
                if (key.equals("*")) continue;

                int compoundCount = seq.countCompounds(compound);
                if (!acidCount.containsKey(key)) {
                    acidCount.put(key, 0);
                }
                count = acidCount.get(key) + compoundCount;
                acidCount.put(key, count);
            }
        }

        return calculateFrequency(acidCount);
}

public static StringMapWritable calculateFrequency(Map<String, Integer> values) {
        int total = 0;
        for (Integer i : values.values()) {
            total += i;
        }

        StringMapWritable frequencies = new StringMapWritable();
        for (Entry<String, Integer> entry : values.entrySet()) {
            float freq = entry.getValue() / (float) total;
            frequencies.put(entry.getKey(), freq);
        }

        return frequencies;
}

Where StringMapWritable is a helper class that extends Map that overrides the toString method, nothing else.

As you can see in aminoAcidFrequency(proteins) I need to loop over every protein and every AminoAcidCompound in order to call countCompounds on the protein sequence. This builds up the compounds inside SequenceMixin two times for every call.

Is there any better way to do this?

tolgap avatar Oct 13 '14 09:10 tolgap

In order to keep the first post a bit short, I'll paste the relevant files here:

SequenceMixin.countCompounds() https://github.com/biojava/biojava/blob/master/biojava3-core/src/main/java/org/biojava3/core/sequence/template/SequenceMixin.java#L49

AbstractSequence.countCompounds() (delegates to SequenceMixin) https://github.com/biojava/biojava/blob/master/biojava3-core/src/main/java/org/biojava3/core/sequence/template/AbstractSequence.java#L607

tolgap avatar Oct 13 '14 09:10 tolgap

Tolga

Have you tried getting the protein sequence as a string and iterating yourself through each amino acid one character as time? SequenceMixin.countCompounds is ok for asking the count of one amino acid but to do that for all amino acids does incur a additional overhead each time you ask and could probably be optimized. Probably easiest for you to do your own method and iterate through the sequence once not per amino acid.

Scooter

On Mon, Oct 13, 2014 at 5:45 AM, Tolga Paksoy [email protected] wrote:

I'm trying to implement an amino acid frequency calculator. I've finished it but I'm using Sequence.countCompounds and it is incredibly slow. I'm profiling this with the JVM Monitor plugin for Eclipse and most time is spent putting things in Hashes inside the SequenceMixin.countCompounds method. For illustration, using countCompounds to eventually get the AAF for a single DNA sequence of about 1000 compounds is ~20 seconds. That is an insane amount of time. Here's a quick example to reproduce this:

Read the fasta file:

public static void main(String[] args) throws IOException { File file = new File("/home/tolga/Work/Erasmus/Classification/Corona/corona.1.genomic.fasta"); FastaReader<DNASequence, NucleotideCompound> proxy = new FastaReader<DNASequence, NucleotideCompound>( file, new GenericFastaHeaderParser<DNASequence, NucleotideCompound>(), new DNASequenceCreator(AmbiguityDNACompoundSet.getDNACompoundSet())); LinkedHashMap<String, DNASequence> dnaSequences = proxy.process();

    for (DNASequence dna : dnaSequences.values()) {
        Map<Frame, Sequence<AminoAcidCompound>> proteins = translateDNAString(dna.getSequenceAsString());
        StringMapWritable freqs = aminoAcidFrequency(proteins);
        System.out.println(freqs);
    }}

The translateDNAString I use:

public static Map<Frame, Sequence<AminoAcidCompound>> translateDNAString(String value) { DNASequence dna = new DNASequence(value, AmbiguityDNACompoundSet.getDNACompoundSet()); TranscriptionEngine engine = new TranscriptionEngine.Builder() .dnaCompounds(AmbiguityDNACompoundSet.getDNACompoundSet()) .rnaCompounds(AmbiguityRNACompoundSet.getDNACompoundSet()) .build(); Frame[] sixFrames = Frame.getAllFrames(); Map<Frame, Sequence<AminoAcidCompound>> results = engine.multipleFrameTranslation(dna, sixFrames);

    return results;}

And these two methods are used for calculating the frequency. Here, I use countCompounds:

public static StringMapWritable aminoAcidFrequency(Map<Frame, Sequence<AminoAcidCompound>> proteins) { Map<String, Integer> acidCount = new LinkedHashMap<String, Integer>();

    int count;
    String key;
    for (Sequence<AminoAcidCompound> seq : proteins.values()) {//          We only want the large proteins
        if (seq.getLength() < 120) continue;
        for (AminoAcidCompound compound : seq.getAsList()) {
            key = compound.getShortName();//              Skip the protein stop codon
            if (key.equals("*")) continue;

            int compoundCount = seq.countCompounds(compound);
            if (!acidCount.containsKey(key)) {
                acidCount.put(key, 0);
            }
            count = acidCount.get(key) + compoundCount;
            acidCount.put(key, count);
        }
    }

    return calculateFrequency(acidCount);}

public static StringMapWritable calculateFrequency(Map<String, Integer> values) { int total = 0; for (Integer i : values.values()) { total += i; }

    StringMapWritable frequencies = new StringMapWritable();
    for (Entry<String, Integer> entry : values.entrySet()) {
        float freq = entry.getValue() / (float) total;
        frequencies.put(entry.getKey(), freq);
    }

    return frequencies;}

Where StringMapWritable is a helper class that extends Map that overrides the toString method, nothing else.

As you can see in aminoAcidFrequency(proteins) I need to loop over every protein and every AminoAcidCompound in order to call countCompounds on the protein sequence. This builds up the compounds inside SequenceMixin two times for every call.

Is there any better way to do this?

— Reply to this email directly or view it on GitHub https://github.com/biojava/biojava/issues/195.

willishf avatar Oct 13 '14 11:10 willishf

@willishf Yes I eventually went that path. But it forces me to think about ambiguity myself when I would like the Biojava library to do it for me.

Perhaps a SequenceMixin.getCompounds() without parameters that would return the count of all compounds as a LinkedHashMap<<? extends Compound>, Integer> might be a good idea?

tolgap avatar Oct 13 '14 12:10 tolgap

Since that probably wouldn't be a method called frequently as a core api probably better to leave it as a user defined class. We are trying to keep the core api stable and flexible. If for some reason a user defined class couldn't do this because a key method was private or the abstraction makes it tough then could make changes to fit in the flexible requirement. On Oct 13, 2014 8:10 AM, "Tolga Paksoy" [email protected] wrote:

@willishf https://github.com/willishf Yes I eventually went that path. But it forces me to think about ambiguity myself when I would like the Biojava library to do it for me.

Perhaps a SequenceMixin.getCompounds() without parameters that would return the count of all compounds as a LinkedHashMap<<? extends Compound>, Integer> might be a good idea?

— Reply to this email directly or view it on GitHub https://github.com/biojava/biojava/issues/195#issuecomment-58883229.

willishf avatar Oct 13 '14 12:10 willishf

Got it, then we can close this issue :+1: .

tolgap avatar Oct 13 '14 12:10 tolgap

Just for the record, there is support for counts and distributions in biojava 1.x http://biojava.org/wiki/BioJava:CookBookLegacy#Counts_and_Distributions

heuermh avatar Oct 13 '14 14:10 heuermh

Agreed we should have the functionality where the question is does it belong in core? I view the functionality as a query of the data model where nice to have a concrete data model that is stable. Difficult to determine all the queries possible of the data and why those type of features would go in another module.

On Mon, Oct 13, 2014 at 10:31 AM, Michael L Heuer [email protected] wrote:

Just for the record, there is support for counts and distributions in biojava 1.x http://biojava.org/wiki/BioJava:CookBookLegacy#Counts_and_Distributions

— Reply to this email directly or view it on GitHub https://github.com/biojava/biojava/issues/195#issuecomment-58900054.

willishf avatar Oct 13 '14 14:10 willishf

I agree it would be nice to have some utility methods for that, something along the lines of what @tolgap is suggesting. Should fit well to the already existing set of tools in -core. I also agree with @willishf that there should be limits to how far we go, however this still seems to be basic enough so that it should get supported.

andreasprlic avatar Oct 13 '14 15:10 andreasprlic

:+1: Sounds like a nice to have and I agree it is basic enough to be in core.

josemduarte avatar Oct 13 '14 15:10 josemduarte