treeval icon indicating copy to clipboard operation
treeval copied to clipboard

[1.2.0 - Ancient Destiny] Chunk fasta reads for better parallelization for revio pacbio data

Open DLBPointon opened this issue 1 year ago • 4 comments

Description of feature

The size of the revio data is huge, this needs to be split into n = (reads / 10million) files. Mapping and then merge the output.

DLBPointon avatar Mar 20 '24 12:03 DLBPointon

if fasta size > 10G, then split the fasta.gz into N chunks, N= round( size_of_fasta/10) pyfasta split -n N {sample}.fasta.gz

yumisims avatar Mar 20 '24 13:03 yumisims

@yumisims @DLBPointon. Maybe use https://nf-co.re/modules/seqkit_split2 ?

mcshane avatar Mar 20 '24 13:03 mcshane

or just zcat {sample}.fasta.gz | awk '/^>/{n++} { print > ("chunk_" int(n/N) ".fasta") }' let's see

yumisims avatar Mar 20 '24 13:03 yumisims

seqkit split2 is multithreaded and will output gzipped chunks

mcshane avatar Mar 20 '24 13:03 mcshane