bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

bcftools consensus --mark-ins

Open loieeyet opened this issue 1 year ago • 2 comments

Hello,

I'm a researcher who has been using bcftools effectively. I was very grateful for the update you provided last time that allowed inserting characters into mark-del. This time, I'd like to discuss the functionality of mark-ins.

Here's an example of a VCF for a single sample:

chr1 786694 - A ATT . PASS AC=- GT 0|1

When running the following command:

samtools faidx Homo_sapiens_assembly38.fasta chr1:786694-786700 | 
bcftools consensus \
--mark-ins lc \ 
--mark-del d \
-H 1 \  # or 2
-s  sample1 \
test.vcf.gz

With H set to 1, the output is ACGCAT, and with H set to 2, the output becomes AttCGCAT.

In this scenario, I would like the output to be A--CGCAT when H is set to 1. In situations of heterozygosity where one allele has an insertion, I'd like to fill in specific characters to ensure both sequences are of equal length.

Do you have plans to support this feature? If not, I would be very grateful if you could direct me to which part to refer to or modify so that I can implement this feature myself.

loieeyet avatar Mar 12 '24 09:03 loieeyet

How would it work if the genotype was, say, triploid or tetraploid, and there were multiple indels of different length? I am struggling to define the behavior for a generalized case.

pd3 avatar Mar 14 '24 17:03 pd3

Considering I was mainly focused on human haploid situations, I realize it's quite a task for you to design a tool that covers generalized cases.

Thinking it over, it seems challenging to add new functionality to the --mark-ins option. What about introducing an option called --pad-insertion CHAR? This would pad sequences at insertion points with a specific CHAR to make lengths consistent, except for the alt allele. It would specifically apply to phased data and require a certain --haplotype N option to be set. For triploid or tetraploid scenarios, this option could take into account the various genotypes, producing a consensus sequence that's appropriately padded. And when dealing with multiple indels of differing lengths, maybe padding according to the longest indel might be a good strategy.

loieeyet avatar Mar 14 '24 18:03 loieeyet