sarek icon indicating copy to clipboard operation
sarek copied to clipboard

Output files for CNVkit incorrect

Open parth2608 opened this issue 11 months ago • 19 comments

Description of the bug

Hello, I am a newbie to nf-core sarek and I am running it on WES data for tumor-only, using CNVkit tool. Even though the pipeline completes successfully, the CNVkit tool directory in the variant_calling directory does not produce all the required outputs.

The output files generated in my case are:

  • cnvkit.reference.antitarget-tmp.bed
  • cnvkit.reference.target-tmp.bed
  • tumor.antitargetcoverage.cnn
  • tumor.targetcoverage.cnn

I have provided the command that I run.

Command used and terminal output

nextflow run nf-core-sarek-3.2.3/workflow --input design.csv --outdir output/ --genome GATK.GRCh38 -profile singularity --wes -c nfcore-RT-wes.config --tools cnvkit --intervals S03723314_Padded_hg38_Agi_v4_PDMR_3_column_only_primary.bed

Relevant files

No response

System information

  • nf-core-sarek-3.2.3
  • HPC
  • singularity/3.7.0
  • nextflow/23.04.3.

parth2608 avatar Sep 21 '23 20:09 parth2608

Hello, I have the same issue with a couple of missing cnvkit output files knowing that the pipeline has completed successfully. Currently, I am processing the cnvkit manually to end up with the cnvkit calls but it don't really know how to make things done by sarek. Any help?

achakroun avatar Nov 08 '23 06:11 achakroun

Hi, Parth and achakroun!

I am quite sure that this issue you've encountered with is related to the bed file you provide for sarek. So, in a nutshell, provide for sarek a bed file with gene symbols. Hope this can help you to obtain full CNVkit output files.

mhjiang97 avatar Nov 08 '23 07:11 mhjiang97

we need more info. Can you please add the .nextflow.log, any custom config files, any parameter files, an overview of the published files, and what files are missing?

FriederikeHanssen avatar Nov 08 '23 08:11 FriederikeHanssen

Uploading .nextflow.log…

Sure: Please find attachent .nextflow.log knowing that no custom config files were used. The command was as follow:

nextflow run nf-core/sarek -profile singularity -resume --max_cpus 10 --max_memory 40.GB --input ./EN00003548_HN00205087.csv --trim_fastq --genome GATK.GRCh38 --save_reference --outdir ./results/EN00003548_HN00205087 --wes --intervals /home/ahmed/NGS/Twist_HumanCoreExome-RefSeq_hg38/Twist_Exome_RefSeq_targets_hg38_200-pad.bed --tools deepvariant,haplotypecaller,freebayes,strelka,manta,tiddit,cnvkit

Output files: sample.antitargetcoverage.cnn sample.targetcoverage.cnn sample.cnr reference.cnn Twist_Exome_RefSeq_targets_hg38_200-pad.antitarget.bed Twist_Exome_RefSeq_targets_hg38_200-pad.target.bed

Missing files: sample-diagram.pdf sample-scatter.png sample.bintest.cns sample.cns sample.call.cns

The last one is the most important because it is just one step before vcf.

Thanks a lot

achakroun avatar Nov 08 '23 08:11 achakroun

.nextflow.log I am not sure that the log was correctly uploaded so here it is again

achakroun avatar Nov 08 '23 09:11 achakroun

Hello everyone, @mhjiang97 I added gene symbols to my interval list bed file and it didn't work. The pipeline completed successfully but expected cnvkit output files still missing. Attached the new .nextflow.log .nextflow.log

achakroun avatar Nov 09 '23 06:11 achakroun

Hello,

I am thinking to use my own targets and reference files specifically with cnvkit (maybe it will resolve the issue of the missing outputs). However, as these are paths and not params, how can I pass them to the pipeline? I tried -params-file with the following yaml:

process { withName: CNVKIT_BATCH { ext.args = "targets /path/to/my_cnvkit_targets.bed" ext.args = "reference /path/to/my_reference.cnn" } }

and it ended up with:

Cannot parse params file: cnvkit-twist-GRCh38.yaml

Since I am not familiar with yaml, I am sure my params file is incorrectly set. I am really sorry.

achakroun avatar Nov 10 '23 06:11 achakroun

can you try passing this to the sarek command line: --cnvkit_reference /path/to/my_reference.cnn

maxulysse avatar Nov 10 '23 09:11 maxulysse

Ok, please what about the targets. I have tried:

--cnvkit_targets /path/to/my_cnvkit_targets.bed

and it ended up with:

WARN: The following invalid input values have been detected:

* --cnvkit_targets: /home/ahmed/NGS/Twist_HumanCoreExome-RefSeq_hg38/cnvkit_targets.bed

achakroun avatar Nov 10 '23 10:11 achakroun

try with --intervals cf: https://nf-co.re/sarek/3.3.2/parameters#intervals

maxulysse avatar Nov 10 '23 13:11 maxulysse

Hello,

@maxulysse thanks a lot for the hint. It worked but I still need to tweak a bit using the cnvkit parameters. Please, could you let me know where can I find more information about the usage of cnvkit with sarek (e.g. where was the "--cnv_reference" argument described and is there other similar arguments we can use for cnvkit through sarek?)

Regards.

achakroun avatar Nov 13 '23 09:11 achakroun

Check https://nf-co.re/sarek/dev/parameters, and be sure to check the show hidden on the right hand side to see all parameters.

maxulysse avatar Nov 13 '23 10:11 maxulysse

Hello, Sorry for posting again but unfortunately I thought the issue was solved but it turns out that it doesn't. Actually, I still have the same missing files with cnvkit, meaning that the output files I have are as follow:

sample.antitargetcoverage.cnn

sample.targetcoverage.cnn
sample.cnr
reference.cnn
Twist_Exome_RefSeq_targets_hg38_200-pad.antitarget.bed
Twist_Exome_RefSeq_targets_hg38_200-pad.target.bed

And the following files are missing:

sample-diagram.pdf
sample-scatter.png
sample.bintest.cns
sample.cns
sample.call.cns

Here is the command I run:

nextflow run nf-core/sarek \
	-profile singularity \
	-resume \
	--wes \
	--intervals cnvkit_targets.bed \
	--step variant_calling \
	--input ./015-24.csv \
	--genome GATK.GRCh38 \
	--outdir ./results/015-24_testing \
	--tools cnvkit \
	--cnvkit_reference my_reference.cnn

When I look into .command.sh corresponding to the process, it looks like the my_reference.cnn I am passing through --cnvkit_reference arg wasn't taken into account!

#!/bin/bash -euo pipefail
samtools view -T Homo_sapiens_assembly38.fasta --fai-reference Homo_sapiens_assembly38.fasta.fai 015-24.recal.cram -@ 2 -o 015-24.recal.bam
samtools index 015-24.recal.bam

cnvkit.py \
    batch \
    015-24.recal.bam \
    --normal  \
    --fasta Homo_sapiens_assembly38.fasta \
     \
    --targets cnvkit_targets.bed \
    --processes 2 \
    --method hybrid --diagram --scatter

cat <<-END_VERSIONS > versions.yml
"NFCORE_SAREK:SAREK:BAM_VARIANT_CALLING_GERMLINE_ALL:BAM_VARIANT_CALLING_CNVKIT:CNVKIT_BATCH":
    samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
    cnvkit: $(cnvkit.py version | sed -e 's/cnvkit v//g')
END_VERSIONS

Also, when I run the cnvkit.py batch command described in the .command.sh manually, I end up with the same missing files and the following warning in the STDOUT:

WARNING: Most antitarget bins (100.00%, 39620/39620) have low or no coverage; is this amplicon/WGS?
Antitargets are nan x more variable than targets

Do you have any suggestions? Thanks. A

achakroun avatar Feb 13 '24 17:02 achakroun

Hey! Good news we are working on it @lescai has put in some time to clean this up and fix it :)

FriederikeHanssen avatar Feb 13 '24 18:02 FriederikeHanssen

Thanks a lot for the feedback @FriederikeHanssen. As I am already using cnvkit "manually", may be I can share the tools sequence that works perfectly for my germline wes-based cnv analysis, if you think it could be helpful. FYI, cnvkit batch doesn't work correctly in my case.

achakroun avatar Feb 14 '24 06:02 achakroun

Hi @achakroun would you be able to share the log file you find in the work folder of your cnvkit batch process? I do have all the files you mentioned using sarek as is,

mysample.converted-diagram.pdf
mysample.converted-scatter.png
mysample.converted.antitargetcoverage.cnn
mysample.converted.bam
mysample.converted.bam.bai
mysample.converted.bintest.cns
mysample.converted.call.cns
mysample.converted.cnr
mysample.converted.cns
mysample.converted.cram -> [..]]
mysample.converted.targetcoverage.cnn
Homo_sapiens_assembly38.fasta -> [..]]
Homo_sapiens_assembly38.fasta.fai -> [..]]
reference.cnn
mytarget.antitarget.bed
mytarget.bed -> [..]]
mytarget.target.bed
versions.yml

my .command.sh looks exactly like the one you referenced above. so I'd need to have a look at the actual process execution. would be great if you could also share your .command.log content to make sure? cheers!

lescai avatar Feb 16 '24 09:02 lescai

Hi @lescai Please, find attached the .command.log from the cnvkit batch work folder and the .nextflow.log of the corresponding run:

.command.log .nextflow.log

Please, are you using your own intervals and cnn reference with cnvkit through the sarek argument --intervals and --cnvkit_reference? If so, is it OK that the cnvkit reference is not mentioned in .command.sh?

FYI, I am processing exclusively WES data for germline "normal" samples.

Please, do not hesitate to ask me for any further details. Looking forward to your feedback. Regards.

achakroun avatar Feb 16 '24 12:02 achakroun

Thanks! I'll have a look and compare as soon as possible. the cnvkit_reference is precisely the issue I was meant to address in this fix, as in the current version the germline analysis does not use it.

lescai avatar Feb 16 '24 13:02 lescai

Actually, the workflow that works for us perfectly when using the standalone cnvkit is as follow:

1- cnvkit coverage with our *_targets.bed 2- cnvkit coverage with our *_antitargets.bed 3- cnvkit fix using our reference.cnn 4- cnvkit segment 5- cnvkit call with method set to cbs (which is default) 6- cnvkit export vcf

For the files *_targets.bed, *_antitargets.bed and reference.cnn, they are prepared once for all using the exome capture kit padded intervals as set by the manufacturer and a set (N=~60) of assumed "normal" samples as reference

Cheers.

achakroun avatar Feb 16 '24 15:02 achakroun