m6anet
m6anet copied to clipboard
Reproducing Fig 2d,e; request for transcriptome2genome.py for dummies
Hi @chrishendra93,
I'm trying to reproduce part of Fig. 2d and 2e using the HCT116 and VIRC models provided in Git to verify that my installation of m6anet is intact and that I can run it as intended. Can I ask if I'm on the right track?
I've downloaded BAM and BLOW5 files for SGNex_Hek293T_directRNA_replicate1_run1
and SGNex_Hct116_directRNA_replicate2_run1
to do this and plan to compare the m6anet inference output on these datasets to the results found in supplementary table 3 and 4 (test on HCT116 and HEK293T respectively), filtering for the 500 genes that have been marked "Test" under set_type.
So far, the different steps seem to have run fine without returning any error or warning messages. m6anet inference
on SGNex_Hct116_directRNA_replicate2_run1
returned 16,112 records in data.site_proba.csv
. Of these, 2,538 records corresponded to the 500 genes marked for testing. I'm not familiar with Python so I used the R code mentioned in #104 for transcriptome-to-genome (t2g) conversion. However, I found only four records that had the same genomic positions as that found in supplementary table 3 and I'm not sure what exactly is the problem (e.g. could be the R code, could be my installation of m6anet
, etc.). Do you have something like a transcriptome2genome.py
script for dummies (I imagine the input will be the data from m6anet inference and the FASTA and GTF files required) so I can rule out the t2g conversion? Or if you have other suggestions, please let me know!
Joel
To add, I tried the code you wrote in #76 but got a KeyError from the first term like so:
>>> fasta_dict = readFasta("/mnt/c/Users/USER/Desktop/m6anet_data/Homo_sapiens.GRCh38.cdna.ncrna.fa")
>>> gtf_dict, _ = readGTF("/mnt/c/Users/USER/Desktop/m6anet_data/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf")
>>> all_transcripts = readalltranscript("/mnt/c/Users/USER/Desktop/m6anet_data/SGNex_Hct116_directRNA_replicate2_run1/data.site_proba.csv")
>>> t2g_dict = {tx: t2g(tx, fasta_dict, gtf_dict) for tx in all_transcripts}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <dictcomp>
File "<stdin>", line 4, in t2g
KeyError: 'ENST00000389680.2,103,81,0.4684189260005951,GAACA,0.3333333333333333'
This is the first line of all_transcripts:
>>> all_transcripts[0]
'ENST00000389680.2,103,81,0.4684189260005951,GAACA,0.3333333333333333'
I checked that the key can be found in fasta_dict and gtf_dict:
>>> fasta_dict['ENST00000389680']
'AATAGGTTTGGTCCTAGCCTTTCTATTAGCTCTTAGTAAGATTACACATGCAAGCATCCCCGTTCCAGTGAGTTCACCCTCTAAATCACCACGATCAAAAGGAACAAGCATCAAGCACGCAGCAATGCAGCTCAAAACGCTTAGCCTAGCCACACCCCCACGGGAAACAGCAGTGATTAACCTTTAGCAATAAACGAAAGTTTAACTAAGCTATACTAACCCCAGGGTTGGTCAATTTCGTGCCAGCCACCGCGGTCACACGATTAACCCAAGTCAATAGAAGCCGGCGTAAAGAGTGTTTTAGATCACCCCCTCCCCAATAAAGCTAAAACTCACCTGAGTTGTAAAAAACTCCAGTTGACACAAAATAGACTACGAAAGTGGCTTTAACATATCTGAACACACAATAGCTAAGACCCAAACTGGGATTAGATACCCCACTATGCTTAGCCCTAAACCTCAACAGTTAAATCAACAAAACTGCTCGCCAGAACACTACGAGCCACAGCTTAAAACTCAAAGGACCTGGCGGTGCTTCATATCCCTCTAGAGGAGCCTGTTCTGTAATCGATAAACCCCGATCAACCTCACCACCTCTTGCTCAGCCTATATACCGCCATCTTCAGCAAACCCTGATGAAGGCTACAAAGTAAGCGCAAGTACCCACGTAAAGACGTTAGGTCAAGGTGTAGCCCATGAGGTGGCAAGAAATGGGCTACATTTTCTACCCCAGAAAACTACGATAGCCCTTATGAAACTTAAGGGTCGAAGGTGGATTTAGCAGTAAACTAAGAGTAGAGTGCTTAGTTGAACAGGGCCCTGAAGCGCGTACACACCGCCCGTCACCCTCCTCAAGTATACTTCAAAGGACATTTAACTAAAACCCCTACGCATTTATATAGAGGAGACAAGTCGTAACATGGTAAGTGTACTGGAAAGTGCACTTGGACGAAC'
>>> gtf_dict['ENST00000389680']
{'chr': 'MT', 'g_id': 'ENSG00000211459', 'strand': '+', 'transcript': [(648, 1601)], 'exon': [(648, 1601)], 'tx_exon': [(0, 953)]}
Is there something missing/wrong in t2g? Below is what I copied. I suspect, but I'm not sure, that the problem lies in the tx_id.split lines:
def t2g(tx_id, fasta_dict, gtf_dict):
t2g_dict = {}
tx_seq = fasta_dict[tx_id.split(".")[0]]
tx_contig = gtf_dict[tx_id]['chr']
g_id = gtf_dict[tx_id]['g_id']
if tx_seq is None:
return [], []
for exon_num in range(len(gtf_dict[tx_id]['exon'])):
g_interval = gtf_dict[tx_id]['exon'][exon_num]
tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
dis_from_start = g_pos - g_interval[0]
if gtf_dict[tx_id]['strand'] == "+":
tx_pos = tx_interval[0] + dis_from_start
elif gtf_dict[tx_id]['strand'] == "-":
tx_pos = tx_interval[1] - dis_from_start
if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]):
kmer = 'XXXXX'
else:
kmer = tx_seq[tx_pos-2:tx_pos+3]
t2g_dict[tx_pos] = g_pos # tx.contig is chromosome.
return t2g_dict
Hi @chrishendra93,
I modified the code you wrote in #76. I made one change to the t2g
function and its input to fix the KeyError I got earlier, added a function to append the genomic position to the original m6anet output, and saved the output in CSV. Can I trouble you to help verify if it looks correct?
def t2g(tx_id, fasta_dict, gtf_dict):
t2g_dict = {}
tx_seq = fasta_dict[tx_id]
tx_contig = gtf_dict[tx_id]['chr']
g_id = gtf_dict[tx_id]['g_id']
if tx_seq is None:
return [], []
for exon_num in range(len(gtf_dict[tx_id]['exon'])):
g_interval = gtf_dict[tx_id]['exon'][exon_num]
tx_interval = gtf_dict[tx_id]['tx_exon'][exon_num]
for g_pos in range(g_interval[0], g_interval[1] + 1): # Exclude the rims of exons.
dis_from_start = g_pos - g_interval[0]
if gtf_dict[tx_id]['strand'] == "+":
tx_pos = tx_interval[0] + dis_from_start
elif gtf_dict[tx_id]['strand'] == "-":
tx_pos = tx_interval[1] - dis_from_start
if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]):
kmer = 'XXXXX'
else:
kmer = tx_seq[tx_pos-2:tx_pos+3]
t2g_dict[tx_pos] = g_pos # tx.contig is chromosome.
return t2g_dict
def appendt2g(all_transcripts, t2g_dict, gtf_dict):
all_transcripts_with_genomic = []
for transcript_entry in all_transcripts:
transcript_id, transcriptome_position, n_reads, probability_modified, kmer, mod_ratio = transcript_entry.split(',')
genomic_position = t2g_dict[transcript_entry][int(transcriptome_position) - 1]
g_id = gtf_dict[transcript_id.split(".")[0]]['g_id']
updated_entry = f"{transcript_id},{transcriptome_position},{n_reads},{probability_modified},{kmer},{mod_ratio},{genomic_position},{g_id}"
all_transcripts_with_genomic.append(updated_entry)
return all_transcripts_with_genomic
import csv
fasta_dict = readFasta("m6anet_data/Homo_sapiens.GRCh38.cdna.ncrna.fa")
gtf_dict, _ = readGTF("m6anet_data/Homo_sapiens.GRCh38.91.chr_patch_hapl_scaff.gtf")
all_transcripts = readalltranscript("m6anet_data/SGNex_Hct116_directRNA_replicate2_run1/data.site_proba.csv")
t2g_dict = {tx: t2g(tx.split(".")[0], fasta_dict, gtf_dict) for tx in all_transcripts}
all_transcripts_with_genomic = appendt2g(all_transcripts, t2g_dict, gtf_dict)
with open('m6anet_data/SGNex_Hct116_directRNA_replicate2_run1/m6anet_t2g.csv', 'w') as file:
writer = csv.writer(file)
for row in all_transcripts_with_genomic:
writer.writerow(row.split(','))