sarek
sarek copied to clipboard
Output files for CNVkit incorrect
Description of the bug
Hello, I am a newbie to nf-core sarek and I am running it on WES data for tumor-only, using CNVkit tool. Even though the pipeline completes successfully, the CNVkit tool directory in the variant_calling directory does not produce all the required outputs.
The output files generated in my case are:
- cnvkit.reference.antitarget-tmp.bed
- cnvkit.reference.target-tmp.bed
- tumor.antitargetcoverage.cnn
- tumor.targetcoverage.cnn
I have provided the command that I run.
Command used and terminal output
nextflow run nf-core-sarek-3.2.3/workflow --input design.csv --outdir output/ --genome GATK.GRCh38 -profile singularity --wes -c nfcore-RT-wes.config --tools cnvkit --intervals S03723314_Padded_hg38_Agi_v4_PDMR_3_column_only_primary.bed
Relevant files
No response
System information
- nf-core-sarek-3.2.3
- HPC
- singularity/3.7.0
- nextflow/23.04.3.
Hello, I have the same issue with a couple of missing cnvkit output files knowing that the pipeline has completed successfully. Currently, I am processing the cnvkit manually to end up with the cnvkit calls but it don't really know how to make things done by sarek. Any help?
Hi, Parth and achakroun!
I am quite sure that this issue you've encountered with is related to the bed file you provide for sarek. So, in a nutshell, provide for sarek a bed file with gene symbols. Hope this can help you to obtain full CNVkit output files.
we need more info. Can you please add the .nextflow.log, any custom config files, any parameter files, an overview of the published files, and what files are missing?
Uploading .nextflow.log…
Sure: Please find attachent .nextflow.log knowing that no custom config files were used. The command was as follow:
nextflow run nf-core/sarek -profile singularity -resume --max_cpus 10 --max_memory 40.GB --input ./EN00003548_HN00205087.csv --trim_fastq --genome GATK.GRCh38 --save_reference --outdir ./results/EN00003548_HN00205087 --wes --intervals /home/ahmed/NGS/Twist_HumanCoreExome-RefSeq_hg38/Twist_Exome_RefSeq_targets_hg38_200-pad.bed --tools deepvariant,haplotypecaller,freebayes,strelka,manta,tiddit,cnvkit
Output files: sample.antitargetcoverage.cnn sample.targetcoverage.cnn sample.cnr reference.cnn Twist_Exome_RefSeq_targets_hg38_200-pad.antitarget.bed Twist_Exome_RefSeq_targets_hg38_200-pad.target.bed
Missing files: sample-diagram.pdf sample-scatter.png sample.bintest.cns sample.cns sample.call.cns
The last one is the most important because it is just one step before vcf.
Thanks a lot
.nextflow.log I am not sure that the log was correctly uploaded so here it is again
Hello everyone, @mhjiang97 I added gene symbols to my interval list bed file and it didn't work. The pipeline completed successfully but expected cnvkit output files still missing. Attached the new .nextflow.log .nextflow.log
Hello,
I am thinking to use my own targets and reference files specifically with cnvkit (maybe it will resolve the issue of the missing outputs). However, as these are paths and not params, how can I pass them to the pipeline? I tried -params-file with the following yaml:
process { withName: CNVKIT_BATCH { ext.args = "targets /path/to/my_cnvkit_targets.bed" ext.args = "reference /path/to/my_reference.cnn" } }
and it ended up with:
Cannot parse params file: cnvkit-twist-GRCh38.yaml
Since I am not familiar with yaml, I am sure my params file is incorrectly set. I am really sorry.
can you try passing this to the sarek command line: --cnvkit_reference /path/to/my_reference.cnn
Ok, please what about the targets. I have tried:
--cnvkit_targets /path/to/my_cnvkit_targets.bed
and it ended up with:
WARN: The following invalid input values have been detected:
* --cnvkit_targets: /home/ahmed/NGS/Twist_HumanCoreExome-RefSeq_hg38/cnvkit_targets.bed
try with --intervals
cf: https://nf-co.re/sarek/3.3.2/parameters#intervals
Hello,
@maxulysse thanks a lot for the hint. It worked but I still need to tweak a bit using the cnvkit parameters. Please, could you let me know where can I find more information about the usage of cnvkit with sarek (e.g. where was the "--cnv_reference" argument described and is there other similar arguments we can use for cnvkit through sarek?)
Regards.
Check https://nf-co.re/sarek/dev/parameters, and be sure to check the show hidden on the right hand side to see all parameters.
Hello, Sorry for posting again but unfortunately I thought the issue was solved but it turns out that it doesn't. Actually, I still have the same missing files with cnvkit, meaning that the output files I have are as follow:
sample.antitargetcoverage.cnn
sample.targetcoverage.cnn
sample.cnr
reference.cnn
Twist_Exome_RefSeq_targets_hg38_200-pad.antitarget.bed
Twist_Exome_RefSeq_targets_hg38_200-pad.target.bed
And the following files are missing:
sample-diagram.pdf
sample-scatter.png
sample.bintest.cns
sample.cns
sample.call.cns
Here is the command I run:
nextflow run nf-core/sarek \
-profile singularity \
-resume \
--wes \
--intervals cnvkit_targets.bed \
--step variant_calling \
--input ./015-24.csv \
--genome GATK.GRCh38 \
--outdir ./results/015-24_testing \
--tools cnvkit \
--cnvkit_reference my_reference.cnn
When I look into .command.sh corresponding to the process, it looks like the my_reference.cnn I am passing through --cnvkit_reference arg wasn't taken into account!
#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai 015-24.recal.cram -@ 2 -o 015-24.recal.bam
samtools index 015-24.recal.bam
cnvkit.py \
batch \
015-24.recal.bam \
--normal \
--fasta Homo_sapiens_assembly38.fasta \
\
--targets cnvkit_targets.bed \
--processes 2 \
--method hybrid --diagram --scatter
cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
cnvkit: $(cnvkit.py version | sed -e 's/cnvkit v//g')
END_VERSIONS
Also, when I run the cnvkit.py batch command described in the .command.sh manually, I end up with the same missing files and the following warning in the STDOUT:
WARNING: Most antitarget bins (100.00%, 39620/39620) have low or no coverage; is this amplicon/WGS?
Antitargets are nan x more variable than targets
Do you have any suggestions? Thanks. A
Hey! Good news we are working on it @lescai has put in some time to clean this up and fix it :)
Thanks a lot for the feedback @FriederikeHanssen.
As I am already using cnvkit "manually", may be I can share the tools sequence that works perfectly for my germline wes-based cnv analysis, if you think it could be helpful. FYI, cnvkit batch
doesn't work correctly in my case.
Hi @achakroun
would you be able to share the log file you find in the work folder of your cnvkit batch
process?
I do have all the files you mentioned using sarek as is,
mysample.converted-diagram.pdf
mysample.converted-scatter.png
mysample.converted.antitargetcoverage.cnn
mysample.converted.bam
mysample.converted.bam.bai
mysample.converted.bintest.cns
mysample.converted.call.cns
mysample.converted.cnr
mysample.converted.cns
mysample.converted.cram -> [..]]
mysample.converted.targetcoverage.cnn
Homo_sapiens_assembly38.fasta -> [..]]
Homo_sapiens_assembly38.fasta.fai -> [..]]
reference.cnn
mytarget.antitarget.bed
mytarget.bed -> [..]]
mytarget.target.bed
versions.yml
my .command.sh
looks exactly like the one you referenced above.
so I'd need to have a look at the actual process execution. would be great if you could also share your .command.log
content to make sure?
cheers!
Hi @lescai
Please, find attached the .command.log from the cnvkit batch
work folder and the .nextflow.log of the corresponding run:
Please, are you using your own intervals and cnn reference with cnvkit through the sarek argument --intervals
and --cnvkit_reference
? If so, is it OK that the cnvkit reference is not mentioned in .command.sh?
FYI, I am processing exclusively WES data for germline "normal" samples.
Please, do not hesitate to ask me for any further details. Looking forward to your feedback. Regards.
Thanks! I'll have a look and compare as soon as possible.
the cnvkit_reference
is precisely the issue I was meant to address in this fix, as in the current version the germline analysis does not use it.
Actually, the workflow that works for us perfectly when using the standalone cnvkit is as follow:
1- cnvkit coverage
with our *_targets.bed
2- cnvkit coverage
with our *_antitargets.bed
3- cnvkit fix
using our reference.cnn
4- cnvkit segment
5- cnvkit call
with method set to cbs (which is default)
6- cnvkit export vcf
For the files *_targets.bed, *_antitargets.bed and reference.cnn, they are prepared once for all using the exome capture kit padded intervals as set by the manufacturer and a set (N=~60) of assumed "normal" samples as reference
Cheers.