masurca icon indicating copy to clipboard operation
masurca copied to clipboard

Error correction of PE reads failed. Check pe.cor.log.

Open jterol opened this issue 6 years ago • 3 comments

Hi!

I'm trying to run masurca with Illumina pair end libraries and pacbio long reads.

Here you have my confog file:

DATA #Illumina paired end reads supplied as <forward_reads> <reverse_reads> #if single-end, do not specify <reverse_reads> #MUST HAVE Illumina paired end reads to use MaSuRCA PE= pe 515 13 /home/jterol/PacBio/ivia000_1.fastq /home/jterol/PacBio/ivia000_2.fastq #Illumina mate pair reads supplied as <forward_reads> <reverse_reads> #pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped #if you have both types of reads supply them both as NANOPORE type PACBIO=/home/jterol/PacBio/PACBIO_clem.fa #NANOPORE=/FULL_PATH/nanopore.fa #Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many #OTHER=/FULL_PATH/file.frg END

PARAMETERS #set this to 1 if your Illumina jumping library reads are shorter than 100bp EXTEND_JUMP_READS=0 #this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE = auto #set this to 1 for all Illumina-only assemblies #set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc) USE_LINKING_MATES = 0 #specifies whether to run mega-reads correction on the grid USE_GRID=0 #specifies queue to use when running on the grid MANDATORY GRID_QUEUE=all.q #batch size in the amount of long read sequence for each batch on the grid GRID_BATCH_SIZE=300000000 #use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads LHE_COVERAGE=25 #set to 1 to only do one pass of mega-reads, for faster but worse quality assembly MEGA_READS_ONE_PASS=0 #this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms LIMIT_JUMP_COVERAGE = 300 #these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically. #set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms. CA_PARAMETERS = cgwErrorRate=0.15 #minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100 KMER_COUNT_THRESHOLD = 1 #whether to attempt to close gaps in scaffolds with Illumina data CLOSE_GAPS=1 #auto-detected number of cpus to use NUM_THREADS = 32 #this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage JF_SIZE = 3000000000 #set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data SOAP_ASSEMBLY=0 END

And here the output I get when running assemble.sh:

[mar oct 2 11:53:45 CEST 2018] Processing pe library reads awk: line ord.:1: fatal: division by zero attempted [mar oct 2 11:53:45 CEST 2018] Average PE read length Illegal division by zero at -e line 1. [mar oct 2 11:53:45 CEST 2018] Using kmer size of for the graph [mar oct 2 11:53:45 CEST 2018] MIN_Q_CHAR: 64 [mar oct 2 11:53:45 CEST 2018] Error correct PE [mar oct 2 11:54:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

This is how my read files look like:

[root@clemen5 PacBio]# head ivia000_1.fastqroot@clemen5 PacBio]# head PACBIO_clem.fa @m54221_171212_235526/4260368/0_10489 GTGAATGGAAAAAGGAGAATTTTCTTTCAGATATCGTACCATTCATTGAGATTTGATCTCGTCCTAACTGATAGCGATGGCCTCCCATTTTCATCCCGTTG CTGAATAAGGACAGCTAACAAGTCCTCATCATGACATGAGCATCGTCTTGTTCTTCCTTTGTCTCCGTTGTTGTCAAACTCTCTCATCTATAATCGCATCA TGATACTTGAGCAGTTCTCATAAGCGTCACTATAAATTTTTTTCAATGCCTTCCAAATCGAACACTCGCATCCAGGGAACATAATCGGATAGGCGAAC...

¿Any suggestion?

Thank you very much in advance for your help

jterol avatar Oct 04 '18 09:10 jterol

you should try fasta format using pacbio reads. best

On Thu, Oct 4, 2018 at 3:21 AM jterol [email protected] wrote:

Hi!

I'm trying to run masurca with Illumina pair end libraries and pacbio long reads.

Here you have my confog file:

DATA #Illumina paired end reads supplied as <forward_reads> <reverse_reads> #if single-end, do not specify <reverse_reads> #MUST HAVE Illumina paired end reads to use MaSuRCA PE= pe 515 13 /home/jterol/PacBio/ivia000_1.fastq /home/jterol/PacBio/ivia000_2.fastq #Illumina mate pair reads supplied as <forward_reads> <reverse_reads> #pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped #if you have both types of reads supply them both as NANOPORE type PACBIO=/home/jterol/PacBio/PACBIO_clem.fa #NANOPORE=/FULL_PATH/nanopore.fa #Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many #OTHER=/FULL_PATH/file.frg END

PARAMETERS #set this to 1 if your Illumina jumping library reads are shorter than 100bp EXTEND_JUMP_READS=0 #this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content GRAPH_KMER_SIZE = auto #set this to 1 for all Illumina-only assemblies #set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc) USE_LINKING_MATES = 0 #specifies whether to run mega-reads correction on the grid USE_GRID=0 #specifies queue to use when running on the grid MANDATORY GRID_QUEUE=all.q #batch size in the amount of long read sequence for each batch on the grid GRID_BATCH_SIZE=300000000 #use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads LHE_COVERAGE=25 #set to 1 to only do one pass of mega-reads, for faster but worse quality assembly MEGA_READS_ONE_PASS=0 #this parameter is useful if you have too many Illumina jumping library mates. Typically set it to 60 for bacteria and 300 for the other organisms LIMIT_JUMP_COVERAGE = 300 #these are the additional parameters to Celera Assembler. do not worry about performance, number or processors or batch sizes -- these are computed automatically. #set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms. CA_PARAMETERS = cgwErrorRate=0.15 #minimum count k-mers used in error correction 1 means all k-mers are used. one can increase to 2 if Illumina coverage >100 KMER_COUNT_THRESHOLD = 1 #whether to attempt to close gaps in scaffolds with Illumina data CLOSE_GAPS=1 #auto-detected number of cpus to use NUM_THREADS = 32 #this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*estimated_coverage JF_SIZE = 3000000000 #set this to 1 to use SOAPdenovo contigging/scaffolding module. Assembly will be worse but will run faster. Useful for very large (>5Gbp) genomes from Illumina-only data SOAP_ASSEMBLY=0 END

And here the output I get when running assemble.sh:

[mar oct 2 11:53:45 CEST 2018] Processing pe library reads awk: line ord.:1: fatal: division by zero attempted [mar oct 2 11:53:45 CEST 2018] Average PE read length Illegal division by zero at -e line 1. [mar oct 2 11:53:45 CEST 2018] Using kmer size of for the graph [mar oct 2 11:53:45 CEST 2018] MIN_Q_CHAR: 64 [mar oct 2 11:53:45 CEST 2018] Error correct PE [mar oct 2 11:54:01 CEST 2018] Error correction of PE reads failed. Check pe.cor.log.

This is how my read files look like:

[root@clemen5 PacBio]# head ivia000_1.fastq @HWI-ST459_0069:1:1:1263:1962#0/1

GGGGGGGGAGGGGAGGAGGGGAGGGGGGGGGGGTGGGGGTGAGTGGAGGANAGGAGGGGNGNGAATGAGGAGGTAAGGGGGGAGGTTGGGTGAGGGAAGC +HWI-ST459_0069:1:1:1263:1962#0/1

_WQX_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @HWI-ST459_0069:1:1:1354:1977#0/1

GGAGGGGGGGGGGGGGGGGGCCGGGGGGGGGGCGGGGGGGGGGGCGAGGGNGGGGGGGGGGGGGGAGAGGTGGAGGGGGGGGGCAGGGGGTGAGGGGAGG +HWI-ST459_0069:1:1:1354:1977#0/1

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB [root@clemen5 PacBio]# head PACBIO_clem.fa @m54221_171212_235526/4260368/0_10489

GTGAATGGAAAAAGGAGAATTTTCTTTCAGATATCGTACCATTCATTGAGATTTGATCTCGTCCTAACTGATAGCGATGGCCTCCCATTTTCATCCCGTTG

CTGAATAAGGACAGCTAACAAGTCCTCATCATGACATGAGCATCGTCTTGTTCTTCCTTTGTCTCCGTTGTTGTCAAACTCTCTCATCTATAATCGCATCA

TGATACTTGAGCAGTTCTCATAAGCGTCACTATAAATTTTTTTCAATGCCTTCCAAATCGAACACTCGCATCCAGGGAACATAATCGGATAGGCGAAC...

¿Any suggestion?

Thank you very much in advance for your help

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alekseyzimin/masurca/issues/70, or mute the thread https://github.com/notifications/unsubscribe-auth/AXaRKEFPVdVjcxZHM9Rk3WXHEom43OOWks5uhdMXgaJpZM4XHsfR .

-- Fuyou Fu, Ph.D. Department of Botany and Plant Pathology Purdue University USA

sunnycqcn avatar Oct 04 '18 12:10 sunnycqcn

dear jterol: have you solved your problem? I am trying to run masurca with only Illumina pair end reads and met the same problem.if this problem is associated with running out of memory?

lly1214 avatar Jan 17 '22 07:01 lly1214

dear jterol: have you solved your problem? I am trying to run masurca with only Illumina pair end reads and met the same problem.if this problem is associated with running out of memory?

lly1214 avatar Jan 17 '22 07:01 lly1214