NextPolish icon indicating copy to clipboard operation
NextPolish copied to clipboard

How to polish with long reads and short reads in Single-End mode

Open gitcruz opened this issue 4 years ago • 5 comments

Hi,

I have tried polishing an assembly with short-reads (HiC reads with -unpaired option in cfg) and long-reads (nextdenovo error corrected reads). But is taking longer than expected (before 5 iterations with PacBio corrected reads took 1 day) . So, I am worried that is not working well but i don't see any error in the pid.log. Below I show you a tail -f

[INFO] 2020-10-08 19:18:29,252 total jobs: 1 [INFO] 2020-10-08 19:18:29,254 Throw jobID:[25602] jobCmd: s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/02.map.ref.sh.work/map_genome0/Lynruf5-3long-2short.sh] in the local_cycle. [INFO] 2020-10-08 22:06:26,494 align_genome done [INFO] 2020-10-08 22:06:26,500 analysis tasks done [INFO] 2020-10-08 22:06:26,505 total jobs: 1 [INFO] 2020-10-08 22:06:26,507 Throw jobID:[4844] jobCmd:[s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/03.merge.bam.sh.work/merge_bam0/Lynruf5-3long-2short.sh] in the local_cycle. [INFO] 2020-10-08 22:18:06,920 merge_bam done [INFO] 2020-10-08 22:18:06,926 analysis tasks done [INFO] 2020-10-08 22:18:06,930 total jobs: 1 [INFO] 2020-10-08 22:18:06,931 Throw jobID:[5628] jobCmd:[s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/out-Lynruf5-3long-2short/00.lgs_polish/04.polish.ref.sh.work/polish_genome0/Lynruf5-3long-2short.sh] in the local_cycle.

I am afraid I might be passing some of the instructions wrong. Perhaps the iterations are not "12" for single-end mode:

[General] job_type = local job_prefix = asm5-3long-2short task = 555121212 rewrite = yes rerun = 2 parallel_jobs = 1 multithread_jobs = 24 genome = ./input_assembly/LynRuf5.fa genome_size = auto workdir = ./out-asm5-3long-2short polish_options = -p {multithread_jobs}

[sgs_option] sgs_fofn = ./sample4-illumina_hic.fofn sgs_options = -unpaired -max_depth 30

[lgs_option] lgs_fofn = ./sample4-corrected_pacbio.fofn lgs_options = -min_read_len 10k -max_read_len 135k -max_depth 40 lgs_minimap2_options = -x map-pb -t 6

The reads fofn look like this: :::::::::::::: ../s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/lru4-corrected_pacbio.fofn :::::::::::::: reads/corrected_pacbio/cns0.fasta reads/corrected_pacbio/cns1.fasta reads/corrected_pacbio/cns2.fasta reads/corrected_pacbio/cns3.fasta reads/corrected_pacbio/cns4.fasta reads/corrected_pacbio/cns5.fasta reads/corrected_pacbio/cns6.fasta :::::::::::::: ../s03.3_p02.6_NextPolish_v1.1.0_long_and_short_reads/lru4-illumina_hic.fofn :::::::::::::: reads/illumina_raw_hic/Lru-4_hic_R1.fastq.gz reads/illumina_raw_hic/Lru-4_hic_R2.fastq.gz

NOTE I removed sensitive info about the genome from the absolute paths. I want to use the HiC reads in single-end mode because they could come from distant locations in the genome.

Please let me know if I am doing anything wrong.

Thanks, F

gitcruz avatar Oct 15 '20 11:10 gitcruz

Hi, in generally, I do not recommend polishing using error-corrected reads, just use raw reads. The error-corrected reads may contain some bias errors (induced by error correction step). BTW, ~30-40x reads are not enough if you want to get a high accuracy assembly. I also dot not recommend polishing using single end reads, because of random mapping in high repeat regions for single end reads.

moold avatar Oct 15 '20 13:10 moold

Ok, I understand. Then I will set things to polish with raw Pacbio reads (~70x). The corrected reads have ~40x coverage.

On a previous test with Pacbio corrected reads I run 5 iterations. But actually the 2nd one seems to be the best (better BUSCO and less missasemblies when comparing to a close reference). I think you recommend 3 for long-read polishing, right?

Thank you very much, Fernando

On Thu, 15 Oct 2020 at 15:15, Hu Jiang [email protected] wrote:

Hi, in generally, I do not recommend polishing using error-corrected reads, just use raw reads. The error-corrected reads may contain some bias errors (induced by error correction step). BTW, ~30-40x reads are not enough if you want to get a high accuracy assembly. I also dot not recommend polishing using single end reads, because of random mapping in high repeat regions for single end reads.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nextomics/NextPolish/issues/54#issuecomment-709316603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB34KVOLYLLZMJ6KSEGMJGDSK3YWBANCNFSM4SR3RWUQ .

gitcruz avatar Oct 15 '20 13:10 gitcruz

Hi, 2-3 iterations is ok, but the finally accuracy of an assembly is depending on short reads polishing.

moold avatar Oct 16 '20 01:10 moold

@moold Hi, Do you mean do not do any form of cleaning on the raw reads, such as with bbtools before using in next polish here?

Rob-murphys avatar Oct 27 '20 12:10 Rob-murphys

Not exactly, I mean, if you want to get a high-accuracy genome, whether you polished it using long reads or not, you should polish the genome using short reads in the last step. It is difficult to produce a high-accuracy genome using long noise reads only. Of course, it's better to do some cleaning on the short raw reads to remove some low QV reads.

moold avatar Oct 27 '20 13:10 moold