Clair3 icon indicating copy to clipboard operation
Clair3 copied to clipboard

Ref calls not printed with `--print_ref_calls` when `--vcf_fn=FILE` is specified

Open gtollefson opened this issue 1 year ago • 8 comments

Hi,

When I run the following clair3 command with --vcf_fn and --print_ref_calls provided, reference records report the genotype as ./. and don't report the other format fields (DP, etc.)

singularity exec \
			-B {params.INPUT_DIR} \
		  	-B {params.OUTPUT_DIR} \
		  	-B {params.REF_DIR} \
			-B {params.RESOURCE_DIR} \
		  	{input.sif_path} \
		  	/opt/bin/run_clair3.sh \
		  	--bam_fn={input.samples_to_run} \
		  	--ref_fn={input.REF} \
		  	--model_path=/opt/models/{params.MODEL_NAME} \
		  	--output={params.OUTPUT_DIR} \
		  	--threads={params.THREADS} \
		  	--platform=ont \
		  	--include_all_ctgs \
		  	--no_phasing_for_fa \
		  	--sample_name={wildcards.samples} \
		  	--gvcf \
		  	--include_all_ctgs \
			--threads=1 \
			--vcf_fn={params.targets_vcf} \
			--print_ref_calls

VCF output example showing missing ref fields: Screenshot 2024-01-10 at 11 48 37 PM

When I run a the same clair3 command but without --vcf_fn, the reference genotypes are reported as 0/0 and other format fields are reported.

singularity exec \
			-B {params.INPUT_DIR} \
		  	-B {params.OUTPUT_DIR} \
		  	-B {params.REF_DIR} \
			-B {params.RESOURCE_DIR} \
		  	{input.sif_path} \
		  	/opt/bin/run_clair3.sh \
		  	--bam_fn={input.samples_to_run} \
		  	--ref_fn={input.REF} \
		  	--model_path=/opt/models/{params.MODEL_NAME} \
		  	--output={params.OUTPUT_DIR} \
		  	--threads={params.THREADS} \
		  	--platform=ont \
		  	--include_all_ctgs \
		  	--no_phasing_for_fa \
		  	--sample_name={wildcards.samples} \
		  	--gvcf \
		  	--include_all_ctgs \
			--threads=1 \
			--print_ref_calls

VCF output example showing present ref fields:

Screenshot 2024-01-10 at 11 49 51 PM

Is this expected behavior? I'd like to be able to run clair3 to report full records for reference calls with GT, GQ, and DP values when --vcf_fn is specified.

gtollefson avatar Jan 11 '24 04:01 gtollefson

Yes it is expected. In the second example, you enabled --gvcf, which incurs more calculations at the reference calls, thus giving you more details.

aquaskyline avatar Jan 15 '24 13:01 aquaskyline

@aquaskyline Thanks for your response, but I don't think that answers my question since I enabled --gvcf for both examples - not just the second example. --vcf_fn is the only difference between the two runs.

Also the output examples I pasted above are both VCF file output (not gvcf files) from runs with --gvcf enabled which are automatically written along with the gvcf files. The gvcf output from the same run with --gvcf and --vcf_fn enabled does show 0/0 calls. I've pasted a screenshot from this run showing that the gvcf output shows 0/0 calls on the same sites that are reported as ./. in the vcf output.

Ref genotype missing at: chr4 748134 in VCF Screenshot 2024-01-16 at 12 00 39 PM

Ref genotype present at: chr4 748134 in GVCF from same run: Screenshot 2024-01-16 at 12 00 49 PM

This suggests that 0/0 genotypes and associated DP, GQ, AD values aren't reported in the VCF output (but are reported in the gvcf output) when --vcf_fn is specified.

We want to use the VCF output in our pipelines, but also need to run clair3 with '--vcf_fn' while outputting 0/0 reference genotypes and associated fields (DP, GQ, AD etc). Is this possible?

gtollefson avatar Jan 16 '24 17:01 gtollefson

@aquaskyline Just following up here. I'm hoping to see if Clair3 can support our use case before we move on to trying out other tools - Clair3 our favorite choice currently but we need to be able to calculate and report DP, AD, etc. for reference calls in non-gvcf output when --vcf_fn is specified.

gtollefson avatar Jan 24 '24 19:01 gtollefson

Hi, @gtollefson,

The './.' variants indicate low-AF sites that were not processed by Clair3 and thus not subsequently included in the output. To output these variants as RefCall, you can try setting --snp_min_af=0.0 and --indel_min_af=0.0 in addition to --vcf_fn.

In the next version, we will set --snp_min_af=0.0 and --indel_min_af=0.0 as the default behavior when '--vcf_fn' is applied.

zhengzhenxian avatar Jan 25 '24 03:01 zhengzhenxian

@gtollefson may we collect your feedback on whether the two settings --snp_min_af=0.0 and --indel_min_af=0.0 solve your problem?

aquaskyline avatar Jan 30 '24 08:01 aquaskyline

Hi @aquaskyline @zhengzhenxian,

Thank you very much for your help. Setting --snp_min_af=0.0 and --indel_min_af=0.0 along with --vcf_fn solved my problem. I am able to generate vcfs which report DP, AD, etc. for reference calls in non-gvcf output when --vcf_fn is specified.

However I just want to point out something interesting. The example site that was missing DP, AD, etc before shows an AF of 0.9094.

The './.' variants indicate low-AF sites that were not processed by Clair3 and thus not subsequently included in the output.

This site had high AF in the AF field. Instead could the reason they were excluded be that the sites had no alternate count in the AD field since they are reference calls?

Screenshot of new vcf output showing this site has high AF but no alternate count in AD field (as is expected for a reference call): Screenshot 2024-01-30 at 1 08 49 PM

gtollefson avatar Jan 30 '24 18:01 gtollefson

--snp_min_af and --indel_min_af actually refer to variant AF, a site with reference AF 0.9094 is with variant AF <0.1

aquaskyline avatar Jan 31 '24 00:01 aquaskyline

Ok I see! Thank you. We appreciate your fast response and clean solution. Best, George

gtollefson avatar Feb 06 '24 19:02 gtollefson

Fixed in v1.0.6.

aquaskyline avatar Mar 17 '24 05:03 aquaskyline