open-cravat icon indicating copy to clipboard operation
open-cravat copied to clipboard

Variants from same position but on different strant

Open kokyriakidis opened this issue 3 years ago • 9 comments

Hi!

How do you manage variants on different strands? For example A->G (+ strand) and T->C (- strand) in the same position. I want to merge some vcf files that have these kind of behavior and I wonder if I need to convert everything to the + strand before using opencravat.

kokyriakidis avatar Sep 01 '21 14:09 kokyriakidis

Hi @kokyriakidis, OpenCRAVAT expects VCF-format input files to follow the VCF format standard which is that all variants are written from the + strand's viewpoint.

With that said, do you mean that some of your VCF-format input files have "chr1 123456 A G" and others have "chr1 123456 T C"? Do they come from some pipeline?

rkimoakbioinformatics avatar Sep 01 '21 14:09 rkimoakbioinformatics

With that said, do you mean that some of your VCF-format input files have "chr1 123456 A G" and others have "chr1 123456 T C"? Do they come from some pipeline?

Yes I mean exactly that. I use a pipeline to detect RNA-Editing events. It outputs a tab seperated file with rna candidate positions. Some may be from - strand some from + strand. I tried to convert them manually to vcf format (using a script I wrote) in order to be able to annotate them using open-cravat. But when I tried to merge the vcfs from all the samples I got a warning from bcftools that "The REF prefixes differ". This is expected because some samples have variation in the same position but in different strand (A->G(+), T->C(-)).

In strand specific protocols RNA Editing Events in humans should be A->G from the + strand and T->C from the - strand. So I guess I should convert everything from the - strand to the + strand.

kokyriakidis avatar Sep 01 '21 14:09 kokyriakidis

Thanks for the information. Yes, at this moment, all variants in a VCF-format file should be based on the + strand.

I can think of adding a "correction" feature to OpenCRAVAT, so that it can check the reference base and convert if necessary, but I'd like to check how indels are written by your pipeline, especially the orientation of sequences. For example,

  1. Insertion of AG between chr1:10000T and chr1:10001C (TC > TAGC on the + strand). 1.1 + strand notation: chr1 10000 T TAG 1.2 - strand notation: chr1 10001 G GCT

  2. Deletion of AG between chr1:1000T and chr1:10003C (TAGC > TC on the + strand). 2.1 + strand notation: chr1 10000 TAG T 2.2 - strand notation: chr1 10003 GTC G

Are 1.2 and 2.2 what happen by your pipeline? If 1.2 is written as chr1 10000 A ATC for example it would be misleading.

rkimoakbioinformatics avatar Sep 01 '21 14:09 rkimoakbioinformatics

Hmm, RNA Editing should not cause any INS OR DEL so I guess I am not the right person to answer something like that. RNA-Editing pipelines output only SNVs.

In mammals, deamination of adenosine (A) in inosine (I) is the most common type ot RNA Editing. Since Inosine mimics the properties og guanosine (G) it is commonly recognised as G by transcription and translation machineries.

kokyriakidis avatar Sep 01 '21 14:09 kokyriakidis

I am not sure if this is the right place to ask but is it possible to include REDIportal 2 RNA-Editing database? http://srv00.recas.ba.infn.it/atlas/

kokyriakidis avatar Sep 01 '21 15:09 kokyriakidis

@kokyriakidis Yes. Thanks for letting us know about the database. We'll take a look and keep you updated.

Regarding the VCF format issue, in that case, automatic conversion may be possible. We'll keep you updated about it, too.

rkimoakbioinformatics avatar Sep 01 '21 19:09 rkimoakbioinformatics

automatic conversion may be possible

In the easy cases, yes. But positions with more than 1 alt, and other such nasties are gonna bite you.

--

Mike Cariaso http://www.cariaso.com

cariaso avatar Sep 01 '21 19:09 cariaso

@kokyriakidis awk+sed command should be enough to convert your - strand VCF file into + strand VCF file. Do you work with a mixed VCF file? or one that represent the + strand and another the - strand? In the first case you should revise your workflow to handle both cases in parallel, convert the - strand file and then merge into a single + strand file.

Juke34 avatar Sep 23 '21 12:09 Juke34

@Juke34 I work with a mixed VCF file and did wat you proposed. Converted the - strand inputs to + strand. AtoG in the - strand was converted to TtoC in the + strand.

kokyriakidis avatar Sep 25 '21 09:09 kokyriakidis