sarek icon indicating copy to clipboard operation
sarek copied to clipboard

A USER ERROR has occurred: Bad input: Sample $name is not in BAM header: [...]

Open petanska opened this issue 1 year ago • 13 comments

Description of the bug

When running sarek 3.0.1 with Mutect2 on tumor-normal paired WES data in uBAM format, the pipeline will crash with an error message saying: """ A USER ERROR has occurred: Bad input: Sample $name is not in BAM header: [...] """

When running the same command on tumor-normal paired WES in fastq format, the pipeline will run just fine.

Command used

ref_dir=/path/to/refs
vep_dir=${ref_dir}/vep106.1

NXF_SINGULARITY_CACHEDIR=/path/to/singularity_cache
export NXF_SINGULARITY_CACHEDIR=$NXF_SINGULARITY_CACHEDIR
nextflow run /my/local/sarek-3.0.1/workflow/ \
	-with-trace trace_complete.txt \
	-with-timeline timeline_complete.html \
	-with-dag dag_complete.pdf \
	--tools 'mutect2' \
	--input 't-n_wes_ubams.csv' \
	--wes \
	--genome GATK.GRCh38 \
	--igenomes_base ${ref_dir} \
	--vep_cache ${vep_dir} \
	--max_time '240.h' \
	--max_memory '32.GB' \
	--max_cpus 16 \
	-profile singularity > complete.log 2>&1

Findings

We checked the code and it indeed looks for {meta.patient}_{meta.normal_id} in the CRAM/BAM header, but our CRAM header only matched the {meta.normal_id}. https://github.com/nf-core/sarek/blob/ad2b34f39fead34d7a09051e67506229e827e892/conf/modules.config#L1111 When we modified the code so that it would only look for {meta.normal_id}, the code would run until the end without any issues. We did a rerun starting from fastq files and there the original pipeline did not crash. We proceeded to look into the read_group generation and there we found the issue: Starting from fastq assigns {row.patient}_{row.sample} to "SM" in the read group, https://github.com/nf-core/sarek/blob/5bb160ddb2cee50c811585e450b16cba13c95a02/workflows/sarek.nf#L1196 while Starting from bam assigns {row.sample} to "SM". https://github.com/nf-core/sarek/blob/5bb160ddb2cee50c811585e450b16cba13c95a02/workflows/sarek.nf#L1219 We think that this is a bug, since Mutect2 variant calling would always crash when looking for the {meta.patient}_{meta.normal_id} in the {row.sample} BAM header of uBAM-derived sequencing data.

We think the solution would be to update the "\tSM:${row.sample}" to "\tSM:${row.patient}_${row.sample}" in line 1219 of sarek/workflows/sarek.nf (However, we did not test this yet).

For now people are able to circumvent this bug in tumor-normal mutect2 runs, by starting from fastq reads.

petanska avatar Sep 07 '22 08:09 petanska

@petanska thank you for reporting this. We will test your proposed fix!

FriederikeHanssen avatar Sep 07 '22 08:09 FriederikeHanssen

Hi,

I am having the same problem, but from cram files. Has this problem been fixed? Or it has to be done from the fastq files?

Thanks,

msguixe avatar Sep 13 '22 07:09 msguixe

what do you mean from CRAM files? Can you post your command? We are working on it currently, but not done yet.

As a workaround you can supply a custom.config for mutect with the expected normal name:

process{
   withName: MUTECT2 {
      ext.args = "<copy ext.args from modules.config to avoid overwritting> + your parameter "
   }
}

and then specify it with a lower case c

FriederikeHanssen avatar Sep 13 '22 10:09 FriederikeHanssen

I mean that I am specifying the cram files in the input files, not the bam or fastq. I will try this solution and let you know if it works. Thanks!

msguixe avatar Sep 13 '22 12:09 msguixe

sure but from which step are you starting? also mapping?

FriederikeHanssen avatar Sep 13 '22 13:09 FriederikeHanssen

No, variant_calling. This is the command:

nextflow run nf-core/sarek -r 3.0.1 -profile singularity -c /workspace/datasets/sjd_seq/code/sarek/sarek_3.0.1.conf --input input_vc.csv --step variant_calling --tools 'strelka,ascat,manta,mutect2' --fasta /workspace/datasets/genomes/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fa --fasta_fai /workspace/datasets/genomes/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fa.fai --dict /workspace/datasets/genomes/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.dict --germline_resource /workspace/datasets/genomes/iGenomes/Homo_sapiens/GATK/GRCh38/Annotation/GermlineResource/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only.vcf.gz --germline_resource_tbi /workspace/datasets/genomes/iGenomes/Homo_sapiens/GATK/GRCh38/Annotation/GermlineResource/gnomAD.r2.1.1.GRCh38.PASS.AC.AF.only.vcf.gz.tbi -resume

And this is the input:

patient,sex,status,sample,cram,crai
pt1,XX,0,AQ5174,/workspace/nobackup/sjd_seq/platinum_results/20220809/pt1-t1-allsamples-t1/sarek_results/results/preprocessing/recalibrated/AQ5174/AQ5174.recal.cram,/workspace/nobackup/sjd_seq/platinum_results/20220809/pt1-t1-allsamples-t1/sarek_results/results/preprocessing/recalibrated/AQ5174/AQ5174.recal.cram.crai
pt1,XX,1,AQ5180,/workspace/nobackup/sjd_seq/platinum_results/20220809/pt1-t1-allsamples-t1/sarek_results/results/preprocessing/recalibrated/AQ5180/AQ5180.recal.cram,/workspace/nobackup/sjd_seq/platinum_results/20220809/pt1-t1-allsamples-t1/sarek_results/results/preprocessing/recalibrated/AQ5180/AQ5180.recal.cram.crai

msguixe avatar Sep 13 '22 13:09 msguixe

ok, that is a different issue ( or requires a different adaption from our side). ( cc @maxulysse we might need to think of a way of how to deal with this in the ling run since GATK can't match properly without this string 🙄 )

Hm, I am not sure from the top of my head if this would solve your problem, but what you could try is set the sample field in you samplesheet (for normal samples only, since this field needs to be unique) and set it with the normal name as specified in your input data. You can then proceed with the custom config as above and set:

process{
   withName: 'NFCORE_SAREK:SAREK:PAIR_VARIANT_CALLING:GATK_TUMOR_NORMAL_SOMATIC_VARIANT_CALLING:MUTECT2' {
      ext.args = { "<copy ext.args from modules.config to avoid overwritting>  --normal-sample ${meta.sample}" }
   }
}

FriederikeHanssen avatar Sep 13 '22 13:09 FriederikeHanssen

I don't understand what do you mean by "sample field in you samplesheet". Same name as the one for the normal in the input data (AQ5174), where?

msguixe avatar Sep 13 '22 13:09 msguixe

In the samplesheet there is a column called sample which is a unique sample name. the normal-sample MUTECT2 uses is somewhere encoded in the BAM/CRAM header (which is what we set manually when the pipeline starts from mapping). However, since you already have the CRAMs, I don't know what you used there when generating them. So you could manually provide this and then access it via the custom config as descirbed above :)

Does that make it a bit clearer?

FriederikeHanssen avatar Sep 13 '22 13:09 FriederikeHanssen

Ok, sorry I am a bit slow. Samplesheet, you mean the input, right? I think what you mean is just to add that confing file and that's it. And leave the input as it is. So the config file will go to the input to take the sample name of the normal, and use it in MUTECT2. Right?

msguixe avatar Sep 13 '22 14:09 msguixe

Yes, sorry that is waht I mean. The file you five with --input that contains all your data. So your input file would look sth like this:

patient,sample,status,cram,crai
p1,normal-sample,0,file.cram,file.crai
p1,tumor_sample,1,file,cram,crai

with a config:

process{
   withName: 'NFCORE_SAREK:SAREK:PAIR_VARIANT_CALLING:GATK_TUMOR_NORMAL_SOMATIC_VARIANT_CALLING:MUTECT2' {
      ext.args = { "<copy ext.args from modules.config to avoid overwritting>  --normal-sample ${meta.normal_id}" }
   }
}

then you command like this:

nextflow run nf-core/sarek -r 3.0.1 -profile ... --input input.csv .... -c custom.config

this should then pick up the ID you specify in the sample column for the normal sample as normal sample in MUTECT2. I have not tested this though, so carefully check the command.sh that it really picks up what you expect

FriederikeHanssen avatar Sep 13 '22 14:09 FriederikeHanssen

Thanks a lot for your suggestion. It worked, the pipeline finished without problems.

msguixe avatar Sep 15 '22 07:09 msguixe

Hi @petanska ! Unfortunately had to push to fix to the next release. Time caught up with us.

FriederikeHanssen avatar Sep 26 '22 19:09 FriederikeHanssen