racon icon indicating copy to clipboard operation
racon copied to clipboard

racon_wrapper

Open mictadlo opened this issue 6 years ago • 13 comments
trafficstars

Hi, I am running racon with Illumina pair-end reads as describe here. However, it needs around 1 Tb of memory. However, you have racon_wrapper and I just wonder how to determine the additional parameter for it:

    --split <int>
        split target sequences into chunks of desired size in bytes
    --subsample <int> <int>
        subsample sequences to desired coverage (2nd argument) given the
        reference length (1st argument)

Does racon_wrapper run the chunks in sequence or in parallel and how memory can be saved?

Additionally, I found a snakemake pipeline for Racon here.

Thank you in advance,

Michal

mictadlo avatar Mar 26 '19 13:03 mictadlo

Hi Michal, if you want to decrease memory you can either use --split <longest contig length> or --subsample <reference length> 50 (or lower coverage) or both. The chunks obtained by splitting are run in sequence.

Best regards, Robert

rvaser avatar Mar 26 '19 22:03 rvaser

I'm having a similar issue: I'm on a machine with 1024 Gb RAM and 48 CPUs available - and input files that are

Reads fastq: 369,668,281,386 Bam: 573,985,197,116 Fasta: 1,649,234,789

= 945 Gb in size.

As I understand it, Racon memory requirements can be estimated as the sum of the size of the input files plus some overhead cost. Depending on the overhead cost, I'm guessing I would be under the RAM maximum by ~50 Gb...? However, things crashed due to memory limitations I think.

I then used racon_wrapper with --split set to 1.1*longest contig length to reduce memory requirements but still it crashed out - again due to memory limits I think. In both cases I watched the memory creep up until it crashed.

racon_wrapper illumina-paradoxus.fq minimap2-illumina_X_flye-ont-polished.sam flye-assembly_racon-ont-polish.fasta -t 45 --split 8483560 > flye-assembly_racon-illumina-polish.fasta

[RaconWrapper::run] preparing data with rampler [RaconWrapper::run] total number of splits: 294 [RaconWrapper::run] processing data with racon [racon::Polisher::initialize] loaded target sequences 0.029969 s terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

I can next try --subsample and was wondering how to estimate a reference length to use.

Or if you have any other suggestions.

Thank you !

000generic avatar Apr 18 '20 19:04 000generic

Hello Eric, the overhead of storing SGS sequences is high with respect to the sequence file, so the total amount of memory needed is 1.5 * sequence file + all other files. To decrease memory, you can use PAF format instead of SAM, and let Racon align the sequences on to go. On the other hand, you can subsample your dataset given the assembly size.

Best regards, Robert

rvaser avatar Apr 19 '20 15:04 rvaser

Great - I will give the PAF format a try!

Regarding subsample - do you mean set the subsample reference length to the assembly length? So in my case to subsample at coverage 50 :

--subsample 1,649,234,789 50

Thank you!

000generic avatar Apr 19 '20 15:04 000generic

Yes, but leave out the commas: --subsample 1649234789 50.

rvaser avatar Apr 19 '20 15:04 rvaser

Awesome - thanks again! Will try both - first PAF - and then if still needed, subsample.

000generic avatar Apr 19 '20 16:04 000generic

Still no luck!

There is 1 Tb of RAM. My reads are 370 Gb, my fasta is 2.5 Gb, and my paf file is 211 Gb.

For minimap2, I mapped reads to a racon ont-polished assembly and supplied the Illumina pair-end reads as separate files. For racon, I cat'd the two files together.

I supplied the racon_wrapper --split flag with 1.1*longest read length.

I created a bash file that contains: racon_wrapper reads.fq minimap2-reads-x-ont-polished-fasta.paf assembly_racon-ont-polished.fasta -t 45 --split 8483561

I ran things as:

bash bash-file > assembly_racon-illumina-polished.fasta &

and got

[1] 15967 (base) ::racon-pilon: [RaconWrapper::run] preparing data with rampler [RaconWrapper::run] total number of splits: 294 [RaconWrapper::run] processing data with racon [racon::Polisher::initialize] loaded target sequences 0.031758 s terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

[1]+ Exit 1 bash bash-file > assembly_racon-illumina-polished.fasta

I wasn't watching past 45% (and growing) memory usage but I'm assuming the memory was used up and this lead to 'std::bad_alloc'

Any idea what is going on? I'll try again with --subsample but was hoping to avoid it, as it seems like you suggest elsewhere it may cause things to underperform.

000generic avatar Apr 26 '20 22:04 000generic

It worked using paf and --subsample!

I ran subsample with coverage 50 - does this seem reasonable / do you have a sense what works well in general? Also, do you feel like it generally produces suboptimal polishing? Would it be a good idea to do 2 rounds of polishing when using --subsample? Or worth trying to continue working things out for --split?

Thank you!

000generic avatar Apr 27 '20 02:04 000generic

Can you please run head -n 1 <first.fastq> <second.fastq>? I want to see if everything worked as intended. Also, by 1.1*longest read length you mean longest contig length? I guess that one iteration should suffice, but you can check BUSCO scores and maybe run a second iteration (if it does not take too much time) and see if it helps.

rvaser avatar Apr 27 '20 07:04 rvaser

Here is head on the read files sent to minimap2:

(base) ::racon-pilon: head -n 1 ../../../reads/illumina-paradoxus-1.fq @FCD05W8ACXX:6:1101:1703:1995#CGGGAGGT/1 (base) ::racon-pilon: head -n 1 ../../../reads/illumina-paradoxus-2.fq @FCD05W8ACXX:6:1101:1703:1995#CGGGAGGT/2

My mistake - I meant 1.1*longest contig - not read!

I'll run a second round of polishing and then BUSCO everyone and see how things look.

Thank you!

000generic avatar Apr 27 '20 16:04 000generic

Everything looks fine. The only thing that bothers me if random subsampling of short reads will work as good as for long reads.

rvaser avatar Apr 27 '20 17:04 rvaser

I wonder about the effects of subsampling also - but I still haven't been able to sort out why --split isn't reducing what I think is the RAM issue despite the data being within the predicted RAM limits when I use PAF.

000generic avatar Apr 27 '20 17:04 000generic

Not sure either. We will probaly have to overhaul Racon and perhaps reimplement some parts.

rvaser avatar Apr 27 '20 17:04 rvaser