Lighter icon indicating copy to clipboard operation
Lighter copied to clipboard

Test - error report - 0 corrected reads?

Open desmodus1984 opened this issue 3 years ago • 7 comments

Hi,

I sequenced a genome and I got 16 paired-end libraries. To test the speed of the program, I picked on library ~11 Gb each .gz file, and used the following code: ./lighter -r A-407_1.fq.gz -r A-407_2.fq.gz -K 15 2300000000 -t 48 -od test

I checked the report, and it is: [2021-02-19 23:57:45] =============Start==================== [2021-02-19 23:57:45] Scanning the input files to infer alpha(sampling rate) [2021-02-20 00:01:37] Average coverage is 7.923 and alpha is 0.883 [2021-02-20 00:01:41] Bad quality threshold is "5" [2021-02-20 00:07:35] Finish sampling kmers [2021-02-20 00:07:35] Bloom filter A's false positive rate: 0.000000 [2021-02-20 00:18:33] Finish storing trusted kmers [2021-02-20 00:37:52] Finish error correction Processed 182234180 reads: 0 are error-free Corrected 0 bases(0.000000 corrections for reads with errors) Trimmed 0 reads with average trimmed bases 0.000000 Discard 0 reads

It looks very weird because 0 reads are error-free, and also looks like none were corrected?

Any suggestions of this result?

Thanks;

desmodus1984 avatar Feb 20 '21 07:02 desmodus1984

The false positive rate of bloom filter A is 0 may suggest that Lighter did not sample any kmers. This maybe due to the computed bad quality score threshold "5" is too high. Can you try run Lighter with option --noQual?

There could be another issue. It seems the read coverage is relatively low, and Lighter is not able to distinguish trust and untrusted kmers As a result, the error correction results might not be optimal.

mourisl avatar Feb 20 '21 07:02 mourisl

Hi,

Please let me ask you another question. How much coverage does one need for error correction? I will do whole-genome resequencing, and to reduce variant call error I thought on doing error correction, but my WGS coverage will be about 10X, per individual - same species. Will that be sufficient for error correction?

Also, what is the effect of genome size on error correction? I used MaSurca for genome assembly, it used a k-mer size of 67, and estimated a genome size of ~2.3 GB. Then, I found that there is a program to find the optimal k-mer size, kmergenie, and the best k-mer size was 65, and the genome size was ~ 2.5 GB - which is the value that I finally used for error correction.

Finally, I tried my whole data set, and I got the following report. Does it look fine? [2021-02-20 01:05:11] =============Start====================[2021-02-20 01:05:11] Scanning the input files to infer alpha(sampling rate)[2021-02-20 01:33:43] Average coverage is 60.467 and alpha is 0.116[2021-02-20 01:33:47] Bad quality threshold is "6"[2021-02-20 02:20:05] Finish sampling kmers[2021-02-20 02:20:05] Bloom filter A's false positive rate: 0.000000[2021-02-20 04:25:09] Finish storing trusted kmers[2021-02-20 07:13:32] Finish error correctionProcessed 1511674590 reads: 1453940690 are error-free Corrected 58814672 bases(1.018720 corrections for reads with errors) Trimmed 0 reads with average trimmed bases 0.000000 Discard 0 reads Thanks;

Juan Pablo Aguilar Cabezas

Ecology and Evolutionary Biology Ph.D. Student

Department of Biological Sciences

Ohio University, Athens OH


From: Li Song [email protected] Sent: Saturday, February 20, 2021 2:54 AM To: mourisl/Lighter [email protected] Cc: Aguilar Cabezas, Juan Pablo [email protected]; Author [email protected] Subject: Re: [mourisl/Lighter] Test - error report - 0 corrected reads? (#34)

The false positive rate of bloom filter A is 0 may suggest that Lighter did not sample any kmers. This maybe due to the computed bad quality score threshold "5" is too high. Can you try run Lighter with option --noQual?

There could be another issue. It seems the read coverage is relatively low, and Lighter is not able to distinguish trust and untrusted kmers As a result, the error correction results might not be optimal.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmourisl%2FLighter%2Fissues%2F34%23issuecomment-782582011&data=04%7C01%7Cja569116%40ohio.edu%7C17c083e90a304629e39f08d8d574c2d0%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494044774273702%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0w2YqOAiXGmVMr5JPZ6dMU8lijJPhRBbOC%2BGQ7O%2FGDQ%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VLJZAGAUT3R45HUOPLS75TDTANCNFSM4X5SNRXQ&data=04%7C01%7Cja569116%40ohio.edu%7C17c083e90a304629e39f08d8d574c2d0%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494044774283691%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7S%2Bca8EvuTXJa6%2F5fJ1tMtx5MvGRt3Mc1oVIu%2Bv3jbw%3D&reserved=0.

desmodus1984 avatar Feb 20 '21 20:02 desmodus1984

From my own experience, Lighter works well with coverage as low as 15X coverage. When <10, Lighter may fail to correct any errors. For example, in this study, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01988-3/figures/2 , Lighter did not correct any errors with 8X coverage. I think this is also the combined effects of read coverage and kmer size. If you set a large kmer size, there might still be statistical powers to distinguish trusted and untrusted kmers in Lighter.

On your second run when combining all samples, it seems the correction sensitivity is low. I think kmer size 15 might be too small given the genome size is 2.3G. You can either try -K 31, or change the MAX_KMER_LENGTH in "utils.h" to 128, recompile Lighter and run Lighter with -K 65. When setting MAX_KMER_LENGTH > 32, Lighter will be slower.

Hope this helps. Thank you!

mourisl avatar Feb 20 '21 21:02 mourisl

Hi,

Wow, I expected a different result from the largest ~60X dataset. Have you heard about Athena? https://www.nature.com/articles/s41598-019-52196-4#Sec16 [https://media.springernature.com/m685/springer-static/image/art%3A10.1038%2Fs41598-019-52196-4/MediaObjects/41598_2019_52196_Fig1_HTML.png]https://www.nature.com/articles/s41598-019-52196-4#Sec16 Athena: Automated Tuning of k-mer based Genomic Error Correction Algorithms using Language Modelshttps://www.nature.com/articles/s41598-019-52196-4#Sec16 www.nature.com It is supposed to be a software for selecting the best k-mer size, and according to them, for a human genomic dataset just like mine ( Dataset Coverage #Reads Read Length Genome Type D7 67X 202M 101 bp Homo sapiens ), in my case the genome is smaller, the best k-mer size was 15 or 17. The only difference is that instead of a read length of 101, my reads are 100 long.

I am running a test with the small dataset with k-mer size of 65, what should be the expectation? What should I get for false discoverity rate - to be considered good?

Thanks;

Juan Pablo Aguilar Cabezas

Ecology and Evolutionary Biology Ph.D. Student

Department of Biological Sciences

Ohio University, Athens OH


From: Li Song [email protected] Sent: Saturday, February 20, 2021 4:57 PM To: mourisl/Lighter [email protected] Cc: Aguilar Cabezas, Juan Pablo [email protected]; Author [email protected] Subject: Re: [mourisl/Lighter] Test - error report - 0 corrected reads? (#34)

From my own experience, Lighter works well with coverage as low as 15X coverage. When <10, Lighter may fail to correct any errors. For example, in this study, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01988-3/figures/2https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgenomebiology.biomedcentral.com%2Farticles%2F10.1186%2Fs13059-020-01988-3%2Ffigures%2F2&data=04%7C01%7Cja569116%40ohio.edu%7C197e08874b8d4b62294b08d8d5ea86dc%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494550553587015%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oM32ANZEVOQmjVk2TZ%2B%2Bi0BBjftoLEHIyfPgbyPLWzI%3D&reserved=0 , Lighter did not correct any errors with 8X coverage. I think this is also the combined effects of read coverage and kmer size. If you set a large kmer size, there might still be statistical powers to distinguish trusted and untrusted kmers in Lighter.

On your second run when combining all samples, it seems the correction sensitivity is low. I think kmer size 15 might be too small given the genome size is 2.3G. You can either try -K 31, or change the MAX_KMER_LENGTH in "utils.h" to 128, recompile Lighter and run Lighter with -K 65. When setting MAX_KMER_LENGTH > 32, Lighter will be slower.

Hope this helps. Thank you!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmourisl%2FLighter%2Fissues%2F34%23issuecomment-782755356&data=04%7C01%7Cja569116%40ohio.edu%7C197e08874b8d4b62294b08d8d5ea86dc%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494550553587015%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=sjGUph4Ka9CUgw6o4R4JERbz6MNvq5YNtb%2BtLxmIlyw%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VL6U6SDYSMLY3L3XALTAAV4ZANCNFSM4X5SNRXQ&data=04%7C01%7Cja569116%40ohio.edu%7C197e08874b8d4b62294b08d8d5ea86dc%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494550553597009%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BxYmMgi4%2BUgDnOtpjGVhN7T%2FqpC3Ewo8Tn7hOf7Smu4%3D&reserved=0.

desmodus1984 avatar Feb 20 '21 23:02 desmodus1984

For the human genome, the kmer size should be at least 19. Since 3G/4^19 is about 1%, this suggests that by random chance, two kmers from two different locations have the sequence is about 1%. In practice, due to repeats, we may want to have a larger kmer size. So I think in many studies, kmer size are picking as 31 to be the balance of computation efficiency (can be represented by a 64 bit variable) and kmer's uniqueness.

mourisl avatar Feb 21 '21 00:02 mourisl

Thanks for the information.

The small dataset test is finished with K-mer size of 65, and I got the following report: [2021-02-20 18:17:52] =============Start====================[2021-02-20 18:17:52] Scanning the input files to infer alpha(sampling rate)[2021-02-20 18:21:43] Average coverage is 7.289 and alpha is 0.960[2021-02-20 18:21:47] Bad quality threshold is "5"[2021-02-20 18:27:12] Finish sampling kmers[2021-02-20 18:27:12] Bloom filter A's false positive rate: 0.000000[2021-02-20 18:38:00] Finish storing trusted kmers[2021-02-20 18:57:12] Finish error correctionProcessed 182234180 reads: 182234180 are error-free Corrected 0 bases(0.000000 corrections for reads with errors) Trimmed 0 reads with average trimmed bases 0.000000 Discard 0 reads I am sending my samples for sequencing at BGI in China, and according to a preliminary analysis, BGI-500 has a larger substitution error rate than Illumina's platforms. BGI does a QC using Soap2 software, all the reads have a PHRED score equal or higher than 30. Is this possible to be all the reads error-free - using a k-mer size of 65?

Thanks;

Juan Pablo Aguilar Cabezas

Ecology and Evolutionary Biology Ph.D. Student

Department of Biological Sciences

Ohio University, Athens OH


From: Li Song [email protected] Sent: Saturday, February 20, 2021 7:01 PM To: mourisl/Lighter [email protected] Cc: Aguilar Cabezas, Juan Pablo [email protected]; Author [email protected] Subject: Re: [mourisl/Lighter] Test - error report - 0 corrected reads? (#34)

For the human genome, the kmer size should be at least 19. Since 3G/4^19 is about 1%, this suggests that by random chance, two kmers from two different locations have the sequence is about 1%. In practice, due to repeats, we may want to have a larger kmer size. So I think in many studies, kmer size are picking as 31 to be the balance of computation efficiency (can be represented by a 64 bit variable) and kmer's uniqueness.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmourisl%2FLighter%2Fissues%2F34%23issuecomment-782768038&data=04%7C01%7Cja569116%40ohio.edu%7C3d329b4442ca4345726a08d8d5fbc9b5%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494624702378121%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=5Rg2ONzJURJYOhXqDOImif4QTrWvE1Z%2Fvc1nxsvK%2FDc%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJWD2VIJT7FCAJAJ2ECYEODTABEMFANCNFSM4X5SNRXQ&data=04%7C01%7Cja569116%40ohio.edu%7C3d329b4442ca4345726a08d8d5fbc9b5%7Cf3308007477c4a70888934611817c55a%7C0%7C0%7C637494624702388119%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oMs2AB%2Ba85cyTXR%2BxjPiH2Iikc3n3sv6JDng9OOLEmA%3D&reserved=0.

desmodus1984 avatar Feb 21 '21 00:02 desmodus1984

It's unlikely to have no error reads. Since your data already has some filtrations with quality score, I would suggest to use "--noQual" option in Lighter. I feel Lighter would still fail to correct given 8X coverage and this high value of alpha (0.960). I would suggest using BFC, Musket or BLESS. If you plan to run all the samples together (~60X coverage), Lighter should work.

mourisl avatar Feb 21 '21 02:02 mourisl