pandora
pandora copied to clipboard
Max coverage and unmapped reads
@mbhall88 noted a potential problem with the --max_covg parameter that can affect pandora map and pandora compare: unmapped reads enter the coverage calculation although they do not contribute for mapping.
More detailed: let's say we have a read x. If x is way too short (that we don't even have enough length to have a window to find a minimizer), then its sketch is going to be empty and it won't enter the coverage computation (this is fine from my POV):
https://github.com/rmcolq/pandora/blob/493be42f61898ce0382597e534f71894a19c1da1/src/utils.cpp#L445-L464
However, let's say x is long enough so that we can sketch it. Then x is going to enter in the coverage computation, but suppose we can't map x anywhere. Consider the case where we can't map 20% of the reads (but these were sketched), thus using information only from 80% of the reads. This means that if we are using --max_covg = 300, and if these reads have equal length, then in fact our "useful" coverage is 0.8 * 300 = 240.
What might be better to do is to compute the coverage based on the mapped reads only.
There are some mini issues also to be discussed related to this parameter:
-
When multithreading was added, we lost mapping determinism (two runs can map slightly different reads. If
--max_covgis not exceeded then all reads are mapped and the two runs map the same set of reads. However, if--max_covgis reached, then we might map slightly different set of reads). Should we care about keeping mapping determinism? -
In case of ONT reads, let's say with the
xlongest reads we reachccoverage, whereas if we choose random reads, we would needy > xto reachccoverage. My guess is that it would be interesting to map the longer reads before the shorter ones, as it could potentially span several loci and provide us with ordering info. It would be a shame to miss the longest reads because we processed shorter ones before due to--max_covg. -
Since mapping is now multithreaded and lighter in RAM, there is also the possibility of removing this parameter (although by default it is set to 300x, and I don't see how going above this threshold would bring anything new...). However, when working with plastic bacteria, it might be hard to infer the true genome size for each sample, affecting thus
--max_covgparameter.
Is this really an issue? What would be the priority for this?
Leandro and I were speaking about this before. Basically, the thing that worries me is that say you sketch a read and only one of it's kmers hit somewhere. Then !sequence.sketch.empty() equates to true and as Leandro said, this read eats into the estimated coverage.
Would it make more sense to have some kind of threshold that divides hits by read length? I think having some sort of filtering on how we select reads could be really powerful. They did this in WhatsHap (Summary in Section 2.2 here) and it was quite powerful (although they are solving a slightly different problem - but I still think we could leverage from this).
Interesting thoughts here!
- Are short reads causing an issue at the moment? For illumina reads they should all be the same length and there is no problem. For nanopore, I had assumed this problem was pretty rare. If only a handful (eg 30) of reads have 40bps, they contribute 30*40/5000000 to the coverage calculation which doesn't seem much to me.
- Counting mapped reads may be better, but there is a value to having the full coverage information as well. It is interesting to see how the mean kmer coverage compares to the total coverage. At the moment the mean kmer coverage ir some sort of model fit is used in some calculations as a better estimate of coverage.
- Size selection: be wary. It is true that longer reads are more informative about ordering information. However I believe that only using long reads would bias against finding some things, e.g. short plasmids. For hybrid bacterial assembly, they got better assemblies from the random sub sample than longest reads.
- I don't care personally about mapping determinism. I think making it clear that you can get a deterministic answer by sub setting reads before hand is fine.
Sent from my Samsung Galaxy smartphone.
-------- Original message -------- From: leoisl [email protected] Date: 06/07/2019 18:21 (GMT+00:00) To: rmcolq/pandora [email protected] Cc: Subscribed [email protected] Subject: [rmcolq/pandora] Max coverage and unmapped reads (#157)
@mbhall88https://github.com/mbhall88 noted a potential problem with the --max_covg parameter that can affect pandora map and pandora compare: unmapped reads enter the coverage calculation although they do not contribute for mapping.
More detailed: let's say we have a read x. If x is way too short (that we don't even have enough length to have a window to find a minimizer), then its sketch is going to be empty and it won't enter the coverage computation (this is fine from my POV):
https://github.com/rmcolq/pandora/blob/493be42f61898ce0382597e534f71894a19c1da1/src/utils.cpp#L445-L464
However, let's say x is long enough so that we can sketch it. Then x is going to enter in the coverage computation, but suppose we can't map x anywhere. Consider the case where we can't map 20% of the reads (but these were sketched), thus using information only from 80% of the reads. This means that if we are using --max_covg = 300, and if these reads have equal length, then in fact our "useful" coverage is 0.8 * 300 = 240.
What might be better to do is to compute the coverage based on the mapped reads only.
There are some mini issues also to be discussed related to this parameter:
-
When multithreading was added, we lost mapping determinism (two runs can map slightly different reads. If --max_covg is not exceeded then all reads are mapped and the two runs map the same set of reads. However, if --max_covg is reached, then we might map slightly different set of reads). Should we care about keeping mapping determinism?
-
In case of ONT reads, let's say with the x longest reads we reach c coverage, whereas if we choose random reads, we would need y > x to reach c coverage. My guess is that it would be interesting to map the longer reads before the shorter ones, as it could potentially span several loci and provide us with ordering info. It would be a shame to miss the longest reads because we processed shorter ones before due to --max_covg.
-
Since mapping is now multithreaded and lighter in RAM, there is also the possibility of removing this parameter (although by default it is set to 300x, and I don't see how going above this threshold would bring anything new...). However, when working with plastic bacteria, it might be hard to infer the true genome size for each sample, affecting thus --max_covg parameter.
Is this really an issue? What would be the priority for this?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/rmcolq/pandora/issues/157?email_source=notifications&email_token=ACLIWOZYYSCN72HWLMNJBJTP6DIAXA5CNFSM4H6T3VZKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G5VNFSQ, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACLIWOZ6RELUKUHP2CL3B3DP6DIAXANCNFSM4H6T3VZA.
Yeah, I agree about size selection. I wouldn't advocate for selecting reads based purely on size, but maybe some kind of metric that takes into account read length, hits, (maybe quality scores?), what day of the week it is etc.
i have lots to say on this, v interesting stuff. will formulate thoughts and post tonight
ok, some ideas:
- we know we will do at least 2 passes of the reads (quasimap, and then compare)
- reads could have value for genotyping (good evidence of hits across at least one gene, we could even record mean quality score across a detected gene) or ordering (but maybe easier to ignore ordering for now).
- we could parse all reads once during quasimap, and index a "score" for each, and on the second pass (compare) only use the ones that are most informative for genotyping. nice thing about slicing reads in first pass, is you have more granularity at second pass - a specific read might be very informative for one gene but low quality/dirty over another. anyway i can imagine parsing the scores and choosing the top 30 or 50 reads for a gene based on score, and then only going back and reading the sequence, for the second pass.
- hm - recording score per read per gene is expensive.
I want to dig this up again.
Based on discussions which I don't think have been documented here, I think it might be best if we remove the --max_covg parameter from pandora. It doesn't make sense for pandora to have the 'responsibility' of filtering reads down a specified coverage. In addition, our current approach to doing this (taking first N reads) could have some serious flaws. i.e. sometimes reads can be grouped by channel in nanopore fastqs. We can always put a disclaimer in the README and tell people to subsample to whatever coverage using pre-existing tools (this issue is actually the reason I wrote rasusa)
OK with me to close
Well, we'll need to remove this parameter and it's associated functionality in the code before closing.
I think this is a subtask of https://github.com/rmcolq/pandora/issues/231
The main purpose of this originally was 2 fold: Primary purpose was to control RAM use by placing a cap, secondary purpose was to downsample so the same threshold for multi-genotyping process
Sounds good. RAM now controlled, and we using other tools directly on fastq to subsample
This is needed for scalability, added label