sortmerna icon indicating copy to clipboard operation
sortmerna copied to clipboard

How to set E-value to compare number of SSU rRNA hits in libraries with different sequencing effort

Open mooreryan opened this issue 4 years ago • 3 comments

I would like to use SortMeRNA to estimate the number of SSU rRNA reads in my libraries. Let's say I have 3 libraries: one with 1,000,000 reads, one with 2,000,000 reads, and one with 10,000,000 reads.

In the docs E-value is described like this

An E-value of 1 signifies that one random alignment is expected for aligning all reads against the reference database. The E-value is computed for the entire search space, not per read.

If I ran each library with an E-value of 1, then the larger libraries would really have a "stricter" threshold for an alignment than the smaller libraries correct? Eg,

Library No. reads E-value Expected random alignments per read
A 1,000,000 1 1E-06
B 2,000,000 1 5E-07
C 10,000,000 1 1E-07

In the case above, library C would have a stricter filtering than library A given its larger size while using the same E-value.

In that case, if I wanted to screen each library for rRNA reads at the same level of "strictness", I would need to adjust the E-value according to the size of the library. Something like this:

Library No. reads E-value Expected rand alns per read
A 1,000,000 1 1E-06
B 2,000,000 2 1E-06
C 10,000,000 10 1E-06

or this:

Library No. reads E-value Expected rand alns per read
A 1,000,000 0.1 1E-07
B 2,000,000 0.2 1E-07
C 10,000,000 1 1E-07

Is this interpretation correct? And if so, do you have guidelines on selecting a good E-value? (E.g., 1 random alignment per 1,000,000 reads or 1/100,000,000, or something like that.)

mooreryan avatar Feb 28 '20 01:02 mooreryan

Hello,

Yes, this looks right : the E-value is the number of random alignments/reads that pass the filter (without sharing a true similarity) during the full computation Modified as proposed, it is transformed into something close to the number of random alignments "per read" that pass the filter ...

Jenya

On Feb. 28, 2020, 02:03, at 02:03, Ryan Moore [email protected] wrote:

I would like to use SortMeRNA to estimate the number of SSU rRNA reads in my libraries. Let's say I have 3 libraries: one with 1,000,000 reads, one with 2,000,000 reads, and one with 10,000,000 reads.

In the docs E-value is described like this

An E-value of 1 signifies that one random alignment is expected for aligning all reads against the reference database. The E-value is computed for the entire search space, not per read.

If I ran each library with an E-value of 1, then the larger libraries would really have a "stricter" threshold for an alignment than the smaller libraries correct? Eg,

Library No. reads E-value Expected random alignments per read
A 1,000,000 1 1E-06
B 2,000,000 1 5E-07
C 10,000,000 1 1E-07

In the case above, library C would have a stricter filtering than library A given its larger size while using the same E-value.

In that case, if I wanted to screen each library for rRNA reads at the same level of "strictness", I would need to adjust the E-value according to the size of the library. Something like this:

Library No. reads E-value Expected rand alns per read
A 1,000,000 1 1E-06
B 2,000,000 2 1E-06
C 10,000,000 10 1E-06

or this:

Library No. reads E-value Expected rand alns per read
A 1,000,000 0.1 1E-07
B 2,000,000 0.2 1E-07
C 10,000,000 1 1E-07

Is this interpretation correct? And if so, do you have guidelines on selecting a good E-value? (E.g., 1 random alignment per 1,000,000 reads or 1/100,000,000, or something like that.)

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/biocore/sortmerna/issues/217

ekopylova avatar Feb 28 '20 13:02 ekopylova

Thanks, that makes sense. Are there any guidelines on reasonable E-values to set?

mooreryan avatar Feb 29 '20 18:02 mooreryan

This is a good question and very neatly asked at that. I wish to be able to answer it just as neat, but no luck I'm afraid. The E-value concept comes from BLAST (see e.g. here or here). There are explanations of what E-value is and where it comes from, how to calculate etc., but not much on how to choose it from the user prospective. I guess would there be an easy answer, it could just as well be incorporated into the algorithm.

biocodz avatar Mar 01 '20 20:03 biocodz