VariantSpark icon indicating copy to clipboard operation
VariantSpark copied to clipboard

variant name in the output file

Open ArashBayatDev opened this issue 5 years ago • 1 comments

The "Biallelic" option in the current version allows for two different representations of variants in the output file.

  • CHR_POS
  • CHR_POS_REF_ALT

I was wondering if this option is extended to allow the user to choose which columns to be used as a variable name in the output file. For example ID column (that contains rsID in most VCF files). Or a custom combination of columns (like bcftools query -f command) for example "CHR_POS_ID".

Ultimately, it would be great if VariantSpark can output a VCF file where the importance score is annotated in the information field of the VCF file. For example VSIS=0.0042 and VSIS=NA for those variants which are not selected in the tree. This annotation facilitates using VariantSpark in different pipelines.

ArashBayatDev avatar Jun 21 '19 01:06 ArashBayatDev

I think the flag (bi-allelic variants) was the results of my evolving (mis)understanding of how variants are represented in VCF files and more precisely, what constitutes a unique key, that can be used to a variable name for random forest.

The correct answers is (as far as I understand now): CHR, POS, REF, ALT+ (that is all ALT alleles).

There should not really be any difference between bi-allelic an mulit-allelic VCF (in this regard).

In particular even for mutli-allelic VCFs it’s possible to have mulitple variants with the same locus (CHR, POS) with different REFs or ALTs, so it was never enough to just use CHR and POS as the variant key.

I am not entirely sure what the current implementation is but ideally the variable name should be CHR_POS_REF_[ALTS] ( alts with different separators) + optionally (as per @Arash Bayat enhancement) rsID.

As for other suggestions

annotating VCF files:

  • I think this should be possible with the HAIL integration API - and we should move in this direction (e.g. switch to Hail 2.0). It should not be added to the existing command line interface. Possibly we can add tool (python/script) or documentation on how to do this from the current CSV output with bcftools or hail.
  • if needed as a command line tool - I think we should develop a python based once spefically for genomics based on the HAIL api.

custom format for variable names

possibly but I think it’s nice to have - and also they still need to be able to generate unique variable names. (edited)

piotrszul avatar Jun 28 '19 07:06 piotrszul

No current need

rocreguant avatar Feb 13 '24 00:02 rocreguant