superFreq icon indicating copy to clipboard operation
superFreq copied to clipboard

not detect samtools

Open Taekya opened this issue 3 years ago • 7 comments

Hello, I load docker image and install samtools using conda, and it works well like this

image image

but when I run superFreq script, samtools not founded image image

ps) finally invoke error ... it It impossible featurecount with paired-end mode??? image

Thank you!

Taekya avatar Feb 24 '21 02:02 Taekya

SuperFreq needs samtools to be available through a system("samtools --version") call from R, and from the warning that doesn't seem to be the case. Docker and conda are out of my expertise though, so I can't support you more than that, sorry.

If you sort out the samtools issue and still get the featurecounts error, you can send me your logfile (Rdirectory/myIndividual/runtimeTracking.log) and maybe I can guess what is going on. It might be related to the bioconducotr version warning though, maybe start by updating all that.

ChristofferFlensburg avatar Feb 24 '21 03:02 ChristofferFlensburg

Thank you for reply. I exited from docker, created new conda environment, installed R v4.0.3 and samtools v1.9, then solved the error related samtools, but it still doesn't work .

2021-02-24 19:46:04
Current memory use (Mb): 630, max use (Mb): 630
Starting differential coverage analysis by sample.
Preparing capture regions for featureCounts..done.
Counting reads over capture regions.

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.4.2

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                                                                            ||
||                           SV-OV-T107-P-DP-NoU_M-ready.bam                  ||
||                                                                            ||
||               Paired-end : no                                              ||
||        Count read pairs : no                                               ||
||              Annotation : R data.frame                                     ||
||      Dir for temp files : .                                                ||
||                 Threads : 24                                               ||
||                   Level : feature level                                    ||
||      Multimapping reads : counted                                          ||
|| Multi-overlapping reads : counted                                          ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid18233 ...         ||
||    Features : 309579                                                       ||
||    Meta-features : 26103                                                   ||
||    Chromosomes/contigs : 24                                                ||
||                                                                            ||
|| Process BAM file SV-OV-T107-P-DP-NoU_M-ready.bam...                        ||
ERROR: Paired-end reads were detected in single-end read library : /storm_KDNA/Analysis/HRD/21T710_SV-OV/GSS/final/SV-OV-T107-P-DP-NoU_M/SV-OV-T107-P-DP-NoU_M-ready.bam
No counts were generated.
Error in .stop_quietly() :
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Error in featureCounts.
Input was
bamFiles: /storm_KDNA/Analysis/HRD/21T710_SV-OV/GSS/final/SV-OV-T107-P-DP-NoU_M/SV-OV-T107-P-DP-NoU_M-ready.bam
captureAnnotation[1:10,]: ? WASH7P WASH7P FAM138A ? OR4G4P ? ? ? ENST00000466430     1 10000 20000 30000 40000 50000 60000 70000 80000 90000  10000  20000  30000  40000  50000  60000  70000  80000  90000 100000 * * * * * * * * * * 1 1 1 1 1 1 1 1 1 1
Error in runDE(bamFiles, sampleNames, externalNormalCoverageBams, captureRegions,  :
  Error in featureCounts.

my data is paired-end, but featureCounts setting show "Paired-end : no" I think this is a cause to problem.

Below is the Rdirectory/myIndividual/runtimeTracking.log

2021-02-24 19:59:48
######################################################################
Running superFreq version 1.4.2
Testing samtools...
Found samtools 1.9 . Seems ok.
Starting run with input files:
sampleMetaDataFile: /storm_KDNA/Analysis/HRD/SV-OV-1to12-GSS/splitMetaData/SV-OV-T107-P-DP-NoU_M.tsv
vcfFiles:

Normal directory: /storm_KDNA/Analysis/HRD/AllData/bam-normal
Normal coverage directory: /storm_KDNA/Analysis/HRD/AllData/bam-normal
dbSNP directory: superFreqResources/dbSNP
capture regions: will be downloaded from superFreq server.
Plotting to /storm_KDNA/Analysis/HRD/SV-OV-1to12-GSS/Rlib/SV-OV-T107-P-DP-NoU_M
Saving R files to /storm_KDNA/Analysis/HRD/SV-OV-1to12-GSS/plot/SV-OV-T107-P-DP-NoU_M
Genome is hg19
Running in  genome  mode.
exacPopulation is all
Running on at most 24 cpus.
Rare germline variants are shown in output.

Parameters for this run are:
   maxCov:              150
   systematicVariance:  0.03
   cloneDistanceCut:  2.326348
   cosmicSalvageRate:  0.001

Normal bamfiles are:
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGS06_20210118-971-2103_LMK-DB_M/21DGS06_20210118-971-2103_LMK-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-1-20181221-911-3001-18DES32-7-KDB-DB_M/21DGSval02-1-20181221-911-3001-18DES32-7-KDB-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-2-20201023-171-5024-20IEM017-01-PJS-DB_M/21DGSval02-2-20201023-171-5024-20IEM017-01-PJS-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-3-20200916-171-5104-20MA1502-BSA-DB_M/21DGSval02-3-20200916-171-5104-20MA1502-BSA-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-4-20201109-941-1901--H0J-DB_M/21DGSval02-4-20201109-941-1901--H0J-DB_M-sort.bam
/storm/Analysis/Germline/DGS/20WGS_validation-D/WGS/final/JSP-DES_M/JSP-DES_M-ready.bam
/storm/Analysis/HRD/SV-OV-2nd-11Samples/bam_normal-30x/MSJ-DES_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS_validation/WGS/final/NA12878_M/NA12878_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS_validation/WGS/final/NA12891_M/NA12891_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS_validation-B/WGS/final/NA12892_M/NA12892_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS001/WGS/final/WGS-Normal/WGS-Normal-ready.bam
Normal bamfiles are:
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGS06_20210118-971-2103_LMK-DB_M/21DGS06_20210118-971-2103_LMK-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-1-20181221-911-3001-18DES32-7-KDB-DB_M/21DGSval02-1-20181221-911-3001-18DES32-7-KDB-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-2-20201023-171-5024-20IEM017-01-PJS-DB_M/21DGSval02-2-20201023-171-5024-20IEM017-01-PJS-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-3-20200916-171-5104-20MA1502-BSA-DB_M/21DGSval02-3-20200916-171-5104-20MA1502-BSA-DB_M-sort.bam
/storm_KDNA/Analysis/HRD/21T712_SV-OV/GSS/work/align/21DGSval02-4-20201109-941-1901--H0J-DB_M/21DGSval02-4-20201109-941-1901--H0J-DB_M-sort.bam
/storm/Analysis/Germline/DGS/20WGS_validation-D/WGS/final/JSP-DES_M/JSP-DES_M-ready.bam
/storm/Analysis/HRD/SV-OV-2nd-11Samples/bam_normal-30x/MSJ-DES_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS_validation/WGS/final/NA12878_M/NA12878_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS_validation/WGS/final/NA12891_M/NA12891_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS_validation-B/WGS/final/NA12892_M/NA12892_M-ready.bam
/storm/Analysis/Germline/DGS/20WGS001/WGS/final/WGS-Normal/WGS-Normal-ready.bam
Loading capture regions..done.
Imported capture regions with 309579 regions and 26103 unique gene names.
Mean GC content is 0.302.
Loading sample meta data from file...done.
Deciding which pairs to scatter plot..done.
Deciding which time series to plot..done.
##################################################################################################

 2021-02-24 19:59:49
 Imported and sanity checked meta data. Looking good so far!
 metadata:
   BAM   VCF   INDIVIDUAL   NAME   TIMEPOINT   NORMAL
   /storm_KDNA/Analysis/HRD/21T710_SV-OV/GSS/final/SV-OV-T107-P-DP-NoU_M/SV-OV-T107-P-DP-NoU_M-ready.bam   /storm_KDNA/Analysis/HRD/SV-OV-1to12-GSS/vcf/SV-OV-T107-P-DP-NoU_M-ready.vcf   SV-OV-T107-P-DP-NoU_M   cancer.96   relapse   NO

 timeSeries:
   cancer.96

##################################################################################################

2021-02-24 19:59:50
Current memory use (Mb): 629.1, max use (Mb): 629.1
Starting differential coverage analysis by sample.
Preparing capture regions for featureCounts..done.
Counting reads over capture regions.
Error in featureCounts.
Input was
bamFiles: /storm_KDNA/Analysis/HRD/21T710_SV-OV/GSS/final/SV-OV-T107-P-DP-NoU_M/SV-OV-T107-P-DP-NoU_M-ready.bam
captureAnnotation[1:10,]: ? WASH7P WASH7P FAM138A ? OR4G4P ? ? ? ENST00000466430     1 10000 20000 30000 40000 50000 60000 70000 80000 90000  10000  20000  30000  40000  50000  60000  70000  80000  90000 100000 * * * * * * * * * * 1 1 1 1 1 1 1 1 1 1

Thank you.

Taekya avatar Feb 24 '21 11:02 Taekya

Hey!

superFreq tells featurecounts to not use the paired end information when running in genome mode, because featurecounts resorts the bam file to have the pairs next to each other. This is reasonable for RNA-Seq as the package is intended for, but prohibitively slow for genomes (re-sorting typically takes longer than the rest of the analysis).

So it is totally intended that featurecounts gets a paired-end bam file but told to run it as single end. It is supposed to be ok with that, but I think they broke it in the latest versions, as we had I think the same problem in a different issue: https://github.com/ChristofferFlensburg/superFreq/issues/67#issuecomment-758769022

Workaround is to backdate Rsubread to an earlier version (sorry) it seems... I can't do much I think, as it's an issue with Rsubread. :/

ChristofferFlensburg avatar Feb 25 '21 06:02 ChristofferFlensburg

Ok, I had a closer look at the latest Rsubread version, and they have indeed changed syntax for how to tell it to to count raw reads (not fragments) in PE samples. And that is most likely they cause of the bug. I made a change to superFreq that reads out the Rsubread version and does the appropriate call. Running tests and will push online, but will (hopefully) be next week. You can either backdate Rsubread if you're in a hurry, or wait for me to finish the testing and push changes online.

ChristofferFlensburg avatar Feb 26 '21 04:02 ChristofferFlensburg

Ok, I made superFreq check the version of Rsubread and use the appropriate syntax.... Hopefully should work for all versions now. Update superFreq and try again, let me know how it goes.

ChristofferFlensburg avatar Mar 01 '21 04:03 ChristofferFlensburg

Thank you, It works! But, When I run SuperFreq script using 64 cpus and 320G (64 * 5G) memories on slurm HPC cluster, occurring error messages like this.

slurmstepd-Robin8: error: Job 1430851 exceeded memory limit (391704704 >
335544320), being killed
slurmstepd-Robin8: error: Exceeded job memory limit
slurmstepd-Robin8: error: *** JOB 1430851 ON Robin8 CANCELLED AT
2021-03-04T13:38:45 ***

Nonetheless I succeed to run SuperFreq script using 192 cpus and 960G (192 * 5G) memories, but It is not sufficient way for multi-queues.

So what I want is to run SuperFreq script with 24~32 cpus and ((24 ~ 32) * 5G) memories. Is it possible?

Below Script is my run script

library(superFreq)
args=commandArgs(TRUE)
participant = as.character(args[1]) ## 1 sample
cpus = as.numeric(args[2]) 
metadata= as.character(args[3])  ## 98 samples
bamnormal=as.character(args[4])  ## 7 samples
Rdir=as.character(args[5])
plotdir=as.character(args[6])
fa=as.character(args[7])  ## hg19.fa
gn=as.character(args[8])  ## genome

superFreq(
          metaDataFile = metadata,  #all individuals in one metaData file.
          genome=gn,
          normalDirectory=bamnormal,
          Rdirectory=Rdir,
          plotDirectory=plotdir,  #each individual will get a subdirectory
in here
          reference=fa,
          mode='genome',
          cpus=cpus,
          participants=participant #,  #run only on this individual
          )

#runSummaryPostAnalysis(metaDataFile=metadata,              <<--- This function is not work.
#           Rdirectory=Rdir,
#           plotDirectory=plotdir,
#           cpus=4,
#           genome=gn)

I attached "runtimeTracking.log" for memory failed job. Thank you for your help. You made my day. runtimeTracking.log

Taekya avatar Mar 04 '21 09:03 Taekya

Hey!

Looks good! Genomes are pushing the limits for superFreq, so you have to be quite careful with resource settings. We run a 30x cancer-normal human genome start to end in ~24h with 2 reference normals. As you have more reference normals, it might be a bit slower, and deeper sequencing also slows things down.

Yeah for genome I think 10-20 cpus is good. Depends on data a bit, but you'll run into diminished return above that, but it'll keep eating your memory. Better err on the side of too small for the cpu parameter. There are regular savepoints, so even if it runs out of walltime and you have to resume, chances are that most of the runtime will be recovered.

Memory also scales with number of reference normals, and while 7 will give you great results, I think genomes are not very variable so I think you might be able to get away with fewer, down to 3 or even 2 if you think all samples are of good quality, with very little loss in accuracy. If you do decide to change the number of reference normals, you want to set up a new reference directory, and rerun the superFreq from start, or the saved results with the larger set of references are likely to screw up the analysis.

Another trick is to first run only one analysis. All potential variants are cross-checked against the reference normals, and they are cached for future runs. As many germline variants and false positives are shared between samples, it might save some time to first run one just to fill up the cache a bit, and the later runs will be able to re-use. Also you probably want to do a test run first anyway.

The runSummaryAnalysis() should only be run once for the entire cohort, and only after all the superFreq runs have completed. It aggregates the results of all the runs and makes some cohort-wide analysis.

Seems like an awesome data set, 98 genomes! Let me know how it goes! :)

ChristofferFlensburg avatar Mar 07 '21 23:03 ChristofferFlensburg