DWGSIM
DWGSIM copied to clipboard
how should I use DWGSIM to generate SNP in RNA data?
Hello, I'm writing to ask your opinion about generating SNP in RNA data. I already used DWGSIM for genomic data and finding the global position of the variants in the VCF was relatively straightforward, I would look for the specific gene in the GTF file and add the position given by DWGSIM to the start of the gene in case of genes on the positive strand or subtract the position given by DWGSIM from the end of the gene in case of genes on the negative strand.
Now, I am using DWGSIM to generate SNP on RNA sequences, therefore I have no introns in the fasta sequences. My question is, how does DWGSIM calculate the position of the variants? From what I understood DWGSIM returns in the mutation file a local position of the variant in the specific gene only by basing the number on the fasta sequence passed as input, therefore if it is an RNA sequence it doesn't count for introns. If this assumption is correct, how should I try to find the real global position of the variant in the VCF in the case of an RNA sequence?
I thought about a calculation that looks like this:
If the Gene is on the positive strand: 1 case: the variant is in the first exon -> I calculate the global position by doing: (the start of the gene) + (position given by DWGSIM) 2 case: the variant is NOT in the first exon, so I need to remove the length of the introns that precede the variant because their length is not taken into consideration by DWGSIM -> I calculate the global position by doing: (the start of the exon where the variant is located) + (position given by DWGSIM) - (the sum of the introns' length until that specific exon where the variant is located)
If the Gene is on the negative strand: 1 case: the variant is in the first exon -> I calculate the global position by doing: (the end of the gene) - (position given by DWGSIM) 2 case: the variant is NOT in the first exon -> I calculate the global position by doing: (the END of the exon where the variant is located) - (position given by DWGSIM) - (the sum of the introns' length until that specific exon where the variant is located)
Does this calculation make sense or perhaps there is a better way to do this?
Thank you so much for your support, Kind Regards Simone Rossi
The position of the variant is 1-based relative to the given contig, in this case transcript.
So for a given transcript, you should have (from the GTF/elswhere) the strand of the transcript, as well as the genomic start/end locations for each exon/feature in the transcript. So I'd create a file with the following columns and a row per exon:
-
tx_id
: the transcript ID (contig name in your input FASTA) -
tx_g_contig
: the contig in the genome (e.g. chr4) -
tx_g_strand
: the strand of the transcript in the genome (+ or -) -
tx_g_start
: the 1-based start position of the transcript in the genome -
tx_g_end
: the 1-based end position of the transcript in the genome (can make inclusive or exclusive) -
exon_id
: the exon id (optional) -
exon_tx_start
: the 1-based start position of the exon in the transcript -
exon_tx_end
: the 1-based end position of the exon in the transcript (can make inclusive or exclusive)
So for a given variant, the CHROM field in the output VCF from DWGSIM is the tx_id
and the POS field is the 1-based start position of the variant in that transcript (the end is the start plus the length of the reference allele minus one). So now you have the span of the variant in the transcript coordinates, and you can then use the above table to find the overlap with exon_tx_start/exon_tx_end
to lift over to the genome. Note: either the variant is contained within a single exon, or can span multiple exons (deletion/MNV/etc.). It's all book-keeping.