nf-boost icon indicating copy to clipboard operation
nf-boost copied to clipboard

General feedbacks

Open nservant opened this issue 11 months ago • 21 comments

Hi @bentsherman

Thank you so much for sharing ǹf-boost, the cleanup fonctionality is eagerly awaited by many users, including us :)

I made a first test case, and I would like to share my results with you ;

  1. First, I had no issue to set up the plug-in.
  2. I tested the cleanup on the test profile of my variant calling pipeline (sarek-like) with a very small dataset, and most of the time it runs without any issue. I only get one error, one time ;
##Command error:
##  Exception in thread "main" java.lang.RuntimeException: File not found 'D262E02_T_vs_D262E01_N_Mutect2_calls_norm_GnomAD_filtered_ICGC_CancerHotspots_COSMIC_dbNSFP.vcf.gz'
##          at org.snpeff.util.Gpr.reader(Gpr.java:501)
##          at org.snpeff.util.Gpr.reader(Gpr.java:484)
##          at org.snpeff.fileIterator.MarkerFileIterator.init(MarkerFileIterator.java:64)
##          at org.snpeff.fileIterator.FileIterator.<init>(FileIterator.java:39)
##          at org.snpeff.fileIterator.MarkerFileIterator.<init>(MarkerFileIterator.java:37)
##          at org.snpeff.fileIterator.VcfFileIterator.<init>(VcfFileIterator.java:82)
##          at org.snpsift.SnpSiftCmdExtractFields.run(SnpSiftCmdExtractFields.java:145)
##          at org.snpsift.SnpSiftCmdExtractFields.run(SnpSiftCmdExtractFields.java:122)
##          at org.snpsift.SnpSift.run(SnpSift.java:580)
##          at org.snpsift.SnpSift.main(SnpSift.java:76)

I think the intermediate vcf file has been deleted, before (or during) the next process starts to use it. This error happens on a NFS-based system on which the IO are a bit limited. I'm just wondering if an additional parameter to specify "after how many time, the intermediate file should be deleted" could be nice to avoid such issue.

  1. Then, I did a real test using a pair of WES data (13.7Go each R1+R2 fastq file), and here, I should say that I was a bit disappointed because the gain in work space is not that high.

Here are the main steps of the pipeline ; mapping/BAM cleaning/GATK Mutect2/CNVs calling.

Here is a summary of the work space over time ;

image

The work reaches 100Go after all mapping post-processing steps. However, I'm a bit surprised that it was not almost completly clean up at the end of the pipeline.

Looking more carefully at the BAM files ... the pipeline generates ;

  • BAM files after bwa-mem
  • BAM files after MarkDup
  • BAM files after cleaning (duplicates, mapQ, etc.)
  • BAM files after BQSR

In practice, only the BAM files after MarkDup have been removed from the work.

>>find . -type f -name "*.bam"
./87/0315166cfca97935de47dbbcc990c4/BC2-0362-PXD_C_part1_hg38.bam
./45/619d03eaea4631cd6e3f6e05535376/BC2-0362-FIXT_T_part1_hg38.bam
./be/44f3273e49aa2c7fa999ae191325c7/BC2-0362-FIXT_T.filtered.bam
./09/6a3fe26f56ded56f5b1802a508b6b2/BC2-0362-PXD_C.filtered.bam
./c6/83094b4423e45d35729ed5518ad6d4/BC2-0362-FIXT_T.recal.bam
./df/e153ef0f160b440361f5576fdb449f/BC2-0362-PXD_C.recal.bam

Could you tell me more about when a given file should be deleted by the system ? In the coming days, I'll try to make additional test on another server with less IO latency to see if it has an impact. But please, let me know if you have any idea or additional test I can perform to help. Thanks Nicolas

nservant avatar Mar 27 '24 09:03 nservant