gffutils
gffutils copied to clipboard
Map transcript coordinates to genome coordinates
I have the coordinates of domains mapped to transcripts so that the coordinates are relative to transcripts. Now I would like to transfer these transcript-coordinates to genome-coordinates. I wonder if there is some magic in gffutils that makes this reasonably easy. Hopefully an example will clarify:
Say my genome gff file contains this transcript with 3 exons:
import gffutils
data = """\
chr1 . gene 11 30 . . . ID=g1
chr1 . mRNA 11 30 . . . ID=t1;Parent=g1
chr1 . exon 11 14 . . . ID=e1;Parent=t1
chr1 . exon 19 22 . . . ID=e2;Parent=t1
chr1 . exon 27 30 . . . ID=e3;Parent=t1"""
db = gffutils.create_db(data.replace(' ', '\t'), ":memory:", from_string=True)
Which would look like this as text ideogram:
GGGGGGGG_g1_GGGGGGGG
EEEE----EEEE----EEEE
1 11 21 31
I have a domain that in transcript coordinates start at position 7 and ends at position 10. So it would look like this:
GGGGGGGG_g1_GGGGGGGG
EEEE----EEEE----EEEE
||----||
1 11 21 31
I would like a function that given the transcript ID and the domain coordinates in transcript-space returns the coordinates in genome-space:
transcriptToGenome(db, txid='t1', tx_start=7, tx_end=10)
# returns something like:
chr1 . dom 21 28 . . . ID=d1;Parent=t1
chr1 . dom 21 22 . . . ID=d1.1;Parent=d1
chr1 . dom 27 28 . . . ID=d1.2;Parent=d1
Before I bang my head on it, I wonder if an easy solution already exists. Thanks!
Interesting, I agree this would be useful, but such magic does not currently exist in this package!
I think the first missing piece would be to handle splicing properly, I don't really have data structures in place for that so it wouldn't be trivial. If you do end up getting something working and can provide some example code, it would be great to include in gffutils.
Thanks for getting in touch so quickly. For now I have this horrible implementation:
def transcriptToGenome(db, txid, tx_start, tx_end, featuretype='exon', child_featuretype='.', parent_featuretype='.'):
features = list(db.children(txid, featuretype=featuretype, order_by='start'))
if len(features) == 0:
raise Exception(f'No feature of type "{featuretype}" found for ID "{txid}"')
tx = db[txid]
txToGenomeMap = {}
out = []
tx_pos = 1
for feature in features:
genome_pos = range(feature.start, feature.end + 1)
for i in genome_pos:
# We map tx coordinates to genome coordinates in a dictionary like:
# {1:11, 2:12, 3:100, 4:101, ...}
# That's horrible but easier later to query
txToGenomeMap[tx_pos] = i
tx_pos += 1
if tx_start < tx_pos:
if txToGenomeMap[tx_start] < feature.start:
gstart = feature.start
else:
gstart = txToGenomeMap[tx_start]
if tx_end >= tx_pos:
gend = feature.end
else:
gend = txToGenomeMap[tx_end]
out.append(gffutils.Feature(tx.chrom, '.', child_featuretype, gstart, gend))
if tx_end < tx_pos:
break
parent = gffutils.Feature(tx.chrom, '.', parent_featuretype, out[0].start, out[-1].end)
out.insert(0, parent)
return(out)
Using toy data from above:
out = transcriptToGenome(db, 't1', 7, 10, 'exon', child_featuretype='chunk', parent_featuretype='dom')
for f in out:
print(f)
chr1 . dom 21 28 . . .
chr1 . chunk 21 22 . . .
chr1 . chunk 27 28 . . .
One would need to fill in the source column, the attributes, strand, etc but that's relatively easy and it depends on context.
I'd be really glad to have a clean gffutils
based solution for this as well! A while back I tried to put together a utility to convert transcript coordinate SAM files to genomic coordinates (https://github.com/mdshw5/transcoorder) but it's not really correct and doesn't handle reads spanning exon-exon junctions.