dDocent
dDocent copied to clipboard
Simple Suggestions To Increase Speed of Freebayes
suggestion
apply a minimum coverage filter to the cov.stats
prior to making the mapped.*.bed
files, which will ultimately reduce the number of contigs genotyped
the minimum coverage value
either the number of individuals or 2x the number of individuals. There's really not much point in evaluating a contig if there's not at least 1 read per allele per locus per individual.
speedup
2x in my current situation, plus the time saved later in filtering
implementation
# calculate 2 x NumInd - 1
minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))
# make file of contigs with low coverage
mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs
# isolate contigs with high coverage
grep -f low.cov.contigs -vF cov.stats > low.cov.stats
# apply changes to cov.stats
mv low.cov.stats cov.stats
# cleanup intermediate files
rm low.cov.contigs
This does help a lot. I've also been removing intervals that are clearly too large or small to be RAD fragments, which further reduces the variant calling overhead.
Hey Chris,
Can you post your code for that?
I'm am additionally experimenting with shuffling the cov.stats rather than sorting them. Will let you all know how that pans out.
I'm afraid to apply the contig splitting measure because of all the new error handling code for freebayes. Not sure if the error handling code is related to the splitting, but I don't get any errors with the older code for making the mapped.*.bed files with no splitting and my own parallelization scheme.
The other thing I'm doing is dividing up the individuals among nodes and genotyping them separately with the intention of combining the vcf files after some mild filtering to cut down file size. I'm jumping off the cliff without looking to see if there's a smooth landing, but I'm in a rush and there's open nodes...
It might be too early to tell, but I think that this is evidence that shuffling the cov.stats is effective at evenly distributing load across all of the threads:
-rw-r--r-- 1 cbird hobi 17M Nov 14 04:13 raw.12.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:13 raw.10.5.5.vcf
-rw-r--r-- 1 cbird hobi 15M Nov 14 04:13 raw.19.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:13 raw.8.5.5.vcf
-rw-r--r-- 1 cbird hobi 19M Nov 14 04:13 raw.2.5.5.vcf
-rw-r--r-- 1 cbird hobi 17M Nov 14 04:14 raw.18.5.5.vcf
-rw-r--r-- 1 cbird hobi 14M Nov 14 04:14 raw.11.5.5.vcf
-rw-r--r-- 1 cbird hobi 14M Nov 14 04:14 raw.14.5.5.vcf
-rw-r--r-- 1 cbird hobi 19M Nov 14 04:14 raw.4.5.5.vcf
-rw-r--r-- 1 cbird hobi 19M Nov 14 04:14 raw.3.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:14 raw.16.5.5.vcf
-rw-r--r-- 1 cbird hobi 16M Nov 14 04:14 raw.9.5.5.vcf
-rw-r--r-- 1 cbird hobi 16M Nov 14 04:14 raw.15.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:14 raw.20.5.5.vcf
-rw-r--r-- 1 cbird hobi 16M Nov 14 04:14 raw.5.5.5.vcf
-rw-r--r-- 1 cbird hobi 17M Nov 14 04:14 raw.6.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:14 raw.7.5.5.vcf
-rw-r--r-- 1 cbird hobi 17M Nov 14 04:14 raw.17.5.5.vcf
-rw-r--r-- 1 cbird hobi 17M Nov 14 04:14 raw.1.5.5.vcf
-rw-r--r-- 1 cbird hobi 20M Nov 14 04:14 raw.13.5.5.vcf
-rw-r--r-- 1 cbird hobi 24M Nov 14 04:19 raw.20.5.5.vcf
-rw-r--r-- 1 cbird hobi 22M Nov 14 04:19 raw.2.5.5.vcf
-rw-r--r-- 1 cbird hobi 23M Nov 14 04:19 raw.19.5.5.vcf
-rw-r--r-- 1 cbird hobi 24M Nov 14 04:19 raw.16.5.5.vcf
-rw-r--r-- 1 cbird hobi 19M Nov 14 04:19 raw.3.5.5.vcf
-rw-r--r-- 1 cbird hobi 25M Nov 14 04:19 raw.8.5.5.vcf
-rw-r--r-- 1 cbird hobi 26M Nov 14 04:19 raw.17.5.5.vcf
-rw-r--r-- 1 cbird hobi 22M Nov 14 04:20 raw.7.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:20 raw.11.5.5.vcf
-rw-r--r-- 1 cbird hobi 23M Nov 14 04:20 raw.13.5.5.vcf
-rw-r--r-- 1 cbird hobi 20M Nov 14 04:20 raw.10.5.5.vcf
-rw-r--r-- 1 cbird hobi 22M Nov 14 04:20 raw.5.5.5.vcf
-rw-r--r-- 1 cbird hobi 18M Nov 14 04:20 raw.15.5.5.vcf
-rw-r--r-- 1 cbird hobi 24M Nov 14 04:20 raw.4.5.5.vcf
-rw-r--r-- 1 cbird hobi 20M Nov 14 04:20 raw.14.5.5.vcf
-rw-r--r-- 1 cbird hobi 25M Nov 14 04:20 raw.6.5.5.vcf
-rw-r--r-- 1 cbird hobi 26M Nov 14 04:20 raw.12.5.5.vcf
-rw-r--r-- 1 cbird hobi 20M Nov 14 04:20 raw.18.5.5.vcf
-rw-r--r-- 1 cbird hobi 25M Nov 14 04:20 raw.9.5.5.vcf
-rw-r--r-- 1 cbird hobi 22M Nov 14 04:20 raw.1.5.5.vcf
To apply this change, replace
| sort -V -k1,1 -k2,2 |
With
| shuf |
I'm pretty sure that code is not repeated elsewhere.
After the raw*vcf
are created, they need to be sorted prior to combining
if [ ! -d "raw.vcf" ]; then
mkdir raw.vcf
fi
#sort the vcf
ls raw.*.vcf | parallel --no-notice -j $NUMProc "cat <(grep -v '^dDocent' {}) <(grep '^dDocent' {} | sort -V -k1,1 -k2,2) > ./raw.vcf/{} "
rm raw.*.vcf
echo ""; echo " "`date` "Assembling final VCF file..."
vcfcombine ./raw.vcf/raw.*.vcf | sed -e 's/ \.\:/ \.\/\.\:/g' > TotalRawSNPs.vcf
To reiterate, this should solve the following type of behavior where the first contigs finish much sooner than the others and then lead to wasted processing power:
-rw-r--r-- 1 cbird hobi 32M Nov 14 04:02 raw.1.5.5.vcf
-rw-r--r-- 1 cbird hobi 46M Nov 14 04:07 raw.2.5.5.vcf
-rw-r--r-- 1 cbird hobi 63M Nov 14 04:14 raw.4.5.5.vcf
-rw-r--r-- 1 cbird hobi 52M Nov 14 04:15 raw.3.5.5.vcf
-rw-r--r-- 1 cbird hobi 72M Nov 14 04:19 raw.5.5.5.vcf
-rw-r--r-- 1 cbird hobi 87M Nov 14 04:25 raw.6.5.5.vcf
-rw-r--r-- 1 cbird hobi 135M Nov 14 04:26 raw.11.5.5.vcf
-rw-r--r-- 1 cbird hobi 202M Nov 14 04:28 raw.16.5.5.vcf
-rw-r--r-- 1 cbird hobi 105M Nov 14 04:28 raw.7.5.5.vcf
-rw-r--r-- 1 cbird hobi 178M Nov 14 04:28 raw.15.5.5.vcf
-rw-r--r-- 1 cbird hobi 113M Nov 14 04:28 raw.10.5.5.vcf
-rw-r--r-- 1 cbird hobi 146M Nov 14 04:28 raw.13.5.5.vcf
-rw-r--r-- 1 cbird hobi 248M Nov 14 04:28 raw.18.5.5.vcf
-rw-r--r-- 1 cbird hobi 165M Nov 14 04:28 raw.14.5.5.vcf
-rw-r--r-- 1 cbird hobi 313M Nov 14 04:28 raw.20.5.5.vcf
-rw-r--r-- 1 cbird hobi 143M Nov 14 04:28 raw.12.5.5.vcf
-rw-r--r-- 1 cbird hobi 120M Nov 14 04:28 raw.9.5.5.vcf
-rw-r--r-- 1 cbird hobi 227M Nov 14 04:28 raw.17.5.5.vcf
-rw-r--r-- 1 cbird hobi 123M Nov 14 04:28 raw.8.5.5.vcf
-rw-r--r-- 1 cbird hobi 294M Nov 14 04:28 raw.19.5.5.vcf
suggestion
apply a minimum coverage filter to the
cov.stats
prior to making themapped.*.bed
files, which will ultimately reduce the number of contigs genotypedthe minimum coverage value
either the number of individuals or 2x the number of individuals. There's really not much point in evaluating a contig if there's not at least 1 read per allele per locus per individual.
speedup
2x in my current situation, plus the time saved later in filtering
implementation
# calculate 2 x NumInd - 1 minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1))) # make file of contigs with low coverage mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs # isolate contigs with high coverage grep -f low.cov.contigs -vF cov.stats > low.cov.stats # apply changes to cov.stats mv low.cov.stats cov.stats # cleanup intermediate files rm low.cov.contigs
Please feel free to submit this as a pull request. This seems reasonable.
I'm not sure about shuffling the cov.stats file. Yes, this would distribute the load more evenly, but it achieves this by splitting up high coverage contigs and by breaking up the first 100 contigs or so which tend to be the most abundant loci. Line 462 already works to create calling intervals that are even across coverage. The TT
variable could be made smaller on line 460 to generate smaller calling intervals, but this also comes at a cost because each initialization of FreeBayes is costly.
Also, the sorting solution presented only works for de novo references.
I tested this on my real-world linkage map data and the method suggested by @cbird808 removes all of my contigs:
> minCOV=$(echo $(($(wc -l namelist | cut -d" " -f1) * 2 - 1)))
> mawk -v minCOV=$minCOV '$4 < minCOV {print $1}' cov.stats | uniq > low.cov.contigs
> wc -l low.cov.contigs
5135 low.cov.contigs
> grep -f low.cov.contigs -vF cov.stats > low.cov.stats
> wc -l low.cov.stats
0 low.cov.stats
this is the cov.stats.
file:
> head -50 cov.stats
Talbacares_1 1495 1771 37
Talbacares_1 1795 2058 32
Talbacares_1 2864 3285 11
Talbacares_1 3785 3924 4
Talbacares_1 3957 4118 36
Talbacares_1 4129 4194 6
Talbacares_1 4205 4484 26
Talbacares_1 5412 5740 9
Talbacares_1 5764 6082 21
Talbacares_1 6409 6633 2
Talbacares_1 6998 7284 12
Talbacares_1 7307 7477 17
Talbacares_1 7485 7806 62
Talbacares_1 7819 8055 24
Talbacares_1 8066 8362 21
Talbacares_1 8531 8781 8
Talbacares_1 8831 9168 42
Talbacares_1 9525 9846 27
Talbacares_1 10104 10239 6
Talbacares_1 10263 10491 10
Talbacares_1 11277 11616 36
Talbacares_1 11640 12142 48
Talbacares_1 12154 12477 50
Talbacares_1 12536 12787 30
Talbacares_1 12798 13044 14
Talbacares_1 16584 16831 27
Talbacares_1 16842 17203 50
Talbacares_1 17227 17446 10
Talbacares_1 17616 17977 49
Talbacares_1 17984 18284 25
Talbacares_1 20320 20600 114
Talbacares_1 20611 20838 18
Talbacares_1 20959 21523 109
Talbacares_1 21547 21837 18
Talbacares_1 21881 22119 51
Talbacares_1 22125 22355 6
Talbacares_1 22837 23289 50
Talbacares_1 26759 26999 9
Talbacares_1 27023 27368 10
Talbacares_1 27392 27492 2
Talbacares_1 28138 28620 66
Talbacares_1 28636 28976 11
Talbacares_1 29983 30319 42
Talbacares_1 30330 30560 24
Talbacares_1 31494 31768 46
Talbacares_1 31791 32129 95
Talbacares_1 32153 32364 9
Talbacares_1 32391 33084 457
Talbacares_1 33190 33506 32
Talbacares_1 33530 33625 4
It seems odd that you coverage is so low.
Story of my life. My data is janky b/c we had to whole-genome-amplify all the samples prior to using ddRAD and there's a noticeable amount of bias from PCR and missing data.
To follow up on this, the filtering should be an opt-in or opt-out to avoid situations like above.
Perhaps, we can put in something logical like > 1, but yes anything more should be optional.
Jon Puritz, PhD
Assistant Professor Department of Biological Sciences University of Rhode Island 120 Flagg Road, Kingston, RI 02881
Webpage: MarineEvoEco.com
Cell: 401-338-8739 Work: 401-874-9020
"The most valuable of all talents is that of never using two words when one will do.” -Thomas Jefferson
On Wed, Jun 09, 2021 at 9:31 AM, Pavel V. Dimens @.***> wrote:
To follow up on this, the filtering should be an opt-in or opt-out to avoid situations like above.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jpuritz/dDocent/issues/58#issuecomment-857696660, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABE5CR3OAQ7QZREOXKKVV53TR5ULFANCNFSM4JNG2Z7A .
That seems like a safe compromise.