ShortStack
ShortStack copied to clipboard
Low number of miRNAs identified
Hello! So I’m trying to use ShortStack for miRNA identification from sorghum roots samples. In order to test the efficiency of our small RNA library protocol we initially sequenced a small pool of 20 samples (which I will call Pool 1). When I ran ShortStack on them, they seemed to work fine and 37 miRNAs were identified. We then later wanted to test out more samples so we sequenced Pool 2 (consisting of 80 samples). However, this pool only yielded 12 miRNAs according to ShortStack. So I’m trying to figure out what might be causing this drastic difference between Pool 1 and Pool 2, especially since Pool 2 has more samples so I was expecting an equal or greater number of miRNAs to be identified.
Issue: Pool 2 has much fewer miRNAs identified than Pool 1 (12 miRNAs vs. 37) despite having more samples (80 samples vs. 20). Why?
Pool 1 –37 miRNAs Pool 2 –12 miRNAs
Sequencing read depth?
- Pool 1: 20 samples; Average read depth of ~6.7 million reads (~⅕ of the samples had less than 5 million reads)
- Pool 2: 80 samples; Average read depth of ~6.6 million reads (a little more than a third of the samples had less than 5 million reads); one outlier with 75 million reads (average without outlier is ~5.8 million reads).
- To test if sequencing read depth played a role in why Pool 2 didn’t have as many miRNAs I ran ShortStack but filtered out lower quality reads. I also decided to run Pool 1 and 2 together, because theoretically the miRNAs identified in pool 1 should show up in the results even if they aren’t found in pool 2. Results:
Pool 1&2, more than 5 million reads: 15 miRNAs Pool 1&2, more than 5 million reads, no outlier: 10 miRNAs Pool 1&2, more than 1 million reads, no outlier: 12 miRNAs
- Unfortunately, despite running pool 1 and 2 together and filtering out lower quality reads, there were still a very low number of miRNAs identified. Why?
Bad sample interfering with algorithm? Sample size?
- I then wondered if maybe the difference in sample size had any impact of the algorithm (since pool 1 had 20 samples and pool 2 had 80). Additionally I wondered if there was one bad sample that was maybe formatted wrong or had some other issue that was messing up the ShortStack run somehow.
- So I divided Pool 2 into sets of 20 and ran them separately. The results are below:
Pool 2, samples 1-20: 36 miRNAs Pool 2, samples 21-40: 0 miRNAs Pool 2, samples 41-60: 18 miRNAs Pool 2, samples 61-80: 10 miRNAs
- I found these results interesting since the first set (1-20) had 36 miRNAs identified, which is comparable to pool 1.
- The 3rd (41-60) and 4th (61-80) sets didn’t surprise me too much since I ordered the samples in order of decreasing total miRNA counts according to the results from running Pool 2 initially. So I would expect the later sets to have fewer miRNAs.
- But set 2 (21-40) having 0 miRNAs identified is a little strange. So to see if there might be a problem sample mixed in somewhere I further subdivided it into four subsets of 5 samples each. The results are below:
Pool 2, samples 21-40, subset 1 (5 samples): 33 miRNAs Pool 2, samples 21-40, subset 2 (5 samples): 30 miRNAs Pool 2, samples 21-40, subset 3 (5 samples): 30 miRNAs Pool 2, samples 21-40, subset 4 (5 samples): 28 miRNAs
- These results are a little confusing to me since they all seem fine. So why did running samples 21-40 together cause issues, but running them in sets of 5 was fine?
Additional Notes: I’ve been using version 3.8.5 since a labmate of mine used that version and I wanted to keep my results comparable to his. But I could switch to the most recent version if you think that would help. I’ve also been using all the defaults, though I have considered changing the --mincov to be something like 0.5 to increase sensitivity. I’ve also been using a Conda environment as well as the same script (just modifying which input samples) for all of the runs.
Do you have any ideas on why Pool 2 doesn’t seem to be working normally? Any help would be greatly appreciated. Thanks!