bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Fixref questions

Open bgb-ipl opened this issue 6 years ago • 12 comments

Hi,

I have some questions about the fixref plugin :

  1. Looking at the code, I can see that "TOP-compatible" and "BOT-compatible" outputs in the logs are not counts but can only take 0 or 1 value. Could you explain what do these flag mean ?

  2. Using Illumina manifest, I can know TOP and BOT alleles but I don't see how I can provide such information to bcftools. For example, how do I represent a TOP [A/T] compared to a BOT [A/T] in the vcf file ?

Best,

bgb-ipl avatar Aug 14 '19 14:08 bgb-ipl

  1. this is a check if all REF/ALT alleles in the VCF are compatible with either TOP or BOT notation: the values mean "true" (1) or "false" (0)

  2. I don't understand the question. In VCF we use neither TOP nor BOT convention, we use the reference strand, defined by the reference genome stored in a fasta file. The TOP/BOT conventions enable to represent alleles locally, in the absence of a reference genome.

pd3 avatar Aug 15 '19 19:08 pd3

  1. Ok, so that's why we get 0 for both if we mix TOP and BOT in the same vcf file.

  2. Sorry, the question was confusing. For example, if I know, following the Illumina manifest, that I have some BOT [T/A] alleles with 2 individuals T/T and A/A (but don't know on which strand). If I have to represent this variant in the vcf for fixref, I should represent it like :

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Ind1	Ind2
1	97547947	snp1	T	A	.	.	.	GT	0/0	1/1

OR

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Ind1	Ind2
1	97547947	snp1	A	T	.	.	.	GT	1/1	0/0

In the end, I have the same results for both representation :

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Ind1	Ind2
1	97547947	snp1_ref	T	A	.	.	.	GT	0/0	1/1

But the first representation tells BOT compatible => 1 and the second representation tells TOP compatible => 1.

So, for statistics and checking purpose it seems better to represents :

  • TOP [A/T] alleles as A=>REF and T=>ALT so that fixref will tell TOP compatible
  • BOT [T/A] alleles as T=>REF and A=>ALT so that fixref will tell BOT compatible

bgb-ipl avatar Aug 16 '19 09:08 bgb-ipl

Hi, I have the same problem described in the second question.

It would be great if we can update bcftools fixref in a way that it can also process A/T & C/G variants from genotyping arrays in the presence of a guide .csv file, such as a manifest file (for example: http://emea.support.illumina.com/downloads/infinium-global-screening-array-v2-0-product-files.html).

The file can indicate whether A/T or C/G variant is actually TOP or BOT, and using that bcftools fixref could process the variant correctly.

maegsul avatar Aug 20 '19 10:08 maegsul

@bgb-ipl Obviously, for a file to be TOP or BOT compatible, all its sites have to be compatible with one of the conventions.

@maegsul This is useful. Knowing which of TOP/BOT is used for all sites, fixref can determine the alleles and fix the VCF.

pd3 avatar Aug 20 '19 11:08 pd3

I agree with @maegsul, a way of providing known TOP/BOT information (from Illumina manifest or info field information for example) could be useful to fixref.

Another point, when testing some random examples, I encounter 2 small bugs/problems (bcftools 1.9) :

  1. When using the -m id without providing a dbSNP file with the parameter -i, the software crashes with a "Segmentation fault". I know it's stupid not to provide the dbSNP file but it should ask for it instead of crashing (if it's the cause).

Command:

bcftools +fixref test.vcf.gz -- -f human.fa -m id

Copy/paste of valgrind output :

==31065== Invalid read of size 1
==31065==    at 0x4B6D70: find_scheme_handler (hfile.c:1005)
==31065==    by 0x4B7037: hopen (hfile.c:1024)
==31065==    by 0x4C01EA: hts_open_format (hts.c:423)
==31065==    by 0x4D2EC0: bcf_sr_add_reader (synced_bcf_reader.c:202)
==31065==    by 0xAAB6AD1: dbsnp_init (fixref.c:291)
==31065==    by 0xAAB6AD1: process (fixref.c:405)
==31065==    by 0x47E293: main_plugin (vcfplugin.c:639)
==31065==    by 0x620AF44: (below main) (libc-start.c:287)
==31065==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==31065== 
==31065== 
==31065== Process terminating with default action of signal 11 (SIGSEGV)
==31065==  Access not within mapped region at address 0x0
==31065==    at 0x4B6D70: find_scheme_handler (hfile.c:1005)
==31065==    by 0x4B7037: hopen (hfile.c:1024)
==31065==    by 0x4C01EA: hts_open_format (hts.c:423)
==31065==    by 0x4D2EC0: bcf_sr_add_reader (synced_bcf_reader.c:202)
==31065==    by 0xAAB6AD1: dbsnp_init (fixref.c:291)
==31065==    by 0xAAB6AD1: process (fixref.c:405)
==31065==    by 0x47E293: main_plugin (vcfplugin.c:639)
==31065==    by 0x620AF44: (below main) (libc-start.c:287)
==31065==  If you believe this happened as a result of a stack
==31065==  overflow in your program's main thread (unlikely but
==31065==  possible), you can try to increase the size of the
==31065==  main thread stack using the --main-stacksize= flag.
==31065==  The main thread stack size used in this run was 8388608.
==31065== 
==31065== HEAP SUMMARY:
==31065==     in use at exit: 326,030 bytes in 708 blocks
==31065==   total heap usage: 1,051 allocs, 343 frees, 818,446 bytes allocated
==31065== 
==31065== LEAK SUMMARY:
==31065==    definitely lost: 0 bytes in 0 blocks
==31065==    indirectly lost: 0 bytes in 0 blocks
==31065==      possibly lost: 0 bytes in 0 blocks
==31065==    still reachable: 326,030 bytes in 708 blocks
==31065==         suppressed: 0 bytes in 0 blocks
  1. I can encounter a "Floating point exception" when using the "top" or "flip" mode when using a vcf without any genotype which needs to be flipped and/or swapped.

Command used:

bcftools +fixref test.vcf.gz -- -f human.fasta -m top

Working example with genotypes:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Ind1
1	881627	snp	T	C	.	.	.	GT	0/0

As a result I have the correct output:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Ind1
1	881627	snp	G	A	.	.	.	GT	1/1

But If I remove the genotypes:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO
1	881627	snp	T	C	.	.	.

I obtain the floating point exception. Copy/paste of valgrind output:

==31300== Process terminating with default action of signal 8 (SIGFPE)
==31300==  Integer divide by zero at address 0x802FBFCED
==31300==    at 0xAAB63AE: set_ref_alt.part.3.constprop.8 (fixref.c:220)
==31300==    by 0xAAB69D8: set_ref_alt (fixref.c:215)
==31300==    by 0xAAB69D8: process (fixref.c:497)
==31300==    by 0x47E293: main_plugin (vcfplugin.c:639)
==31300==    by 0x620AF44: (below main) (libc-start.c:287)
==31300== 
==31300== HEAP SUMMARY:
==31300==     in use at exit: 324,543 bytes in 682 blocks
==31300==   total heap usage: 1,020 allocs, 338 frees, 816,939 bytes allocated
==31300== 
==31300== LEAK SUMMARY:
==31300==    definitely lost: 0 bytes in 0 blocks
==31300==    indirectly lost: 0 bytes in 0 blocks
==31300==      possibly lost: 0 bytes in 0 blocks
==31300==    still reachable: 324,543 bytes in 682 blocks
==31300==         suppressed: 0 bytes in 0 blocks
==31300== Rerun with --leak-check=full to see details of leaked memory

I think that not having genotype, which is a valid vcf, should not failed this way.

bgb-ipl avatar Aug 20 '19 16:08 bgb-ipl

Hi, apologies for the delay. I am happy to add a support for this. However, I'd need real world data for testing because it's tricky to get the ambiguous G/C and A/T combinations right. Any chance you could share a test file offline? In order to discuss this possibility, my email address is on the profile page.

pd3 avatar Sep 03 '19 10:09 pd3

Hi,

I don't really see what do you mean by "real world data" : is it related to the comment about integrating Illumina manifest information on TOP/BOT or about the 2 bugs ?

bgb-ipl avatar Sep 06 '19 11:09 bgb-ipl

I meant the manifest feature suggested by @maegsul.

pd3 avatar Sep 06 '19 11:09 pd3

I can't provide any real data but you can search https://my.pgp-hms.org/public_genetic_data for publicly available data and then use the Illumina manifest file of the link provided by @maegsul. I think 23andme uses this GSA chip.

bgb-ipl avatar Sep 06 '19 11:09 bgb-ipl

Indeed, I agree with @bgb-ipl. Publicly available data on https://opensnp.org/genotypes & https://www.personalgenomes.org/ could be used for real world data.

I have 23andme raw data myself, and I am sure that what they name as "v5" (you can see it on the file name) is Global Screening Array (GSA) chip from Illumina. Especially data submitted from 2018 onward should be on v5 (GSA) https://www.xcode.life/23andme/23andme-v5-chip-dna-raw-data-analysis/

I can convert this 23andme file format to VCF using BCFTools plugin https://samtools.github.io/bcftools/howtos/convert.html , however it could be that 23andme marker names are not matching with original marker names in the manifest file always...

maegsul avatar Sep 06 '19 12:09 maegsul

So I am very interested in knowing if this addition of manifest files was ever implemented? Thanks for any update

ESDeutekom avatar Sep 30 '24 13:09 ESDeutekom

No, it was not. A good test case would be required to even start thinking about it

pd3 avatar Oct 12 '24 15:10 pd3