quast icon indicating copy to clipboard operation
quast copied to clipboard

QUAST storing sam, bam, sorted.bam

Open ghost opened this issue 5 years ago • 2 comments

Hello, I was wondering why does QUAST store at the same time the sam, bam and sorted.bam? It takes a huge lot of disk space. I tried the option --space-efficient but it still writes to the disk a sam, then a bam and then a sorted.bam. So basically the alignment is written 3 times to the disk.

Here is my command

./quast-5.0.2/quast.py --eukaryote --large --circos --pe1 $R1 --pe2 $R2 --pacbio ../allPB.fa --nanopore ../allONTvaga.fa --threads 24 -o quast_report shasta_final.fa --space-efficient

thank you

EDIT, is it because --space-efficient is wrongly placed as an argument? If so sorry ><

ghost avatar Jan 21 '20 18:01 ghost

Actually it's not a question of the argument wrongly placed. I also notice it seems to use only half the number of the specified thread count.

ghost avatar Jan 21 '20 23:01 ghost

I vote in support of this issue. The temporary storage required when analyzing raw reads appears excessive due to redundancy and may lead to most of the "No space left on device" errors. One example I ran into: I have a 12 Mbase genome and an assembly of the same size I would like to evaluate.

  • 3GB of nanopore reads (fastq.gz)
  • 16GB of illumina reads (fastq.gz)
  • The whole analysis directory < 100GB including several processed data and multiple assemblies.

The process maxed out at 500GB in the (when the disk ran full) quast temporary folder which contained:

  • copies of the input reads (fastq unzipped)
  • .sam+bam files of all the alignments + the sorted .sam files

I think this problem could be addressed with relative ease by deleting intermediate files (e.g. deleting sam files once bam files have been created) or using samtools via a unix pipe. From what I understand from the documentation --space-efficient refers to RAM requirements, not disk.

mdondrup avatar Feb 15 '24 09:02 mdondrup