gffutils icon indicating copy to clipboard operation
gffutils copied to clipboard

Map transcript coordinates to genome coordinates

Open dariober opened this issue 2 years ago • 3 comments

I have the coordinates of domains mapped to transcripts so that the coordinates are relative to transcripts. Now I would like to transfer these transcript-coordinates to genome-coordinates. I wonder if there is some magic in gffutils that makes this reasonably easy. Hopefully an example will clarify:

Say my genome gff file contains this transcript with 3 exons:

import gffutils

data = """\
chr1 . gene 11 30 . . . ID=g1
chr1 . mRNA 11 30 . . . ID=t1;Parent=g1
chr1 . exon 11 14 . . . ID=e1;Parent=t1
chr1 . exon 19 22 . . . ID=e2;Parent=t1
chr1 . exon 27 30 . . . ID=e3;Parent=t1"""

db = gffutils.create_db(data.replace(' ', '\t'), ":memory:", from_string=True)

Which would look like this as text ideogram:

          GGGGGGGG_g1_GGGGGGGG
          EEEE----EEEE----EEEE
1         11        21        31 

I have a domain that in transcript coordinates start at position 7 and ends at position 10. So it would look like this:

          GGGGGGGG_g1_GGGGGGGG
          EEEE----EEEE----EEEE
                    ||----||
1         11        21        31

I would like a function that given the transcript ID and the domain coordinates in transcript-space returns the coordinates in genome-space:

transcriptToGenome(db, txid='t1', tx_start=7, tx_end=10)
# returns something like:
chr1	.	dom	21	28	.	.	.	ID=d1;Parent=t1
chr1	.	dom	21	22	.	.	.	ID=d1.1;Parent=d1
chr1	.	dom	27	28	.	.	.	ID=d1.2;Parent=d1

Before I bang my head on it, I wonder if an easy solution already exists. Thanks!

dariober avatar Jan 05 '23 11:01 dariober

Interesting, I agree this would be useful, but such magic does not currently exist in this package!

I think the first missing piece would be to handle splicing properly, I don't really have data structures in place for that so it wouldn't be trivial. If you do end up getting something working and can provide some example code, it would be great to include in gffutils.

daler avatar Jan 05 '23 14:01 daler

Thanks for getting in touch so quickly. For now I have this horrible implementation:

def transcriptToGenome(db, txid, tx_start, tx_end, featuretype='exon', child_featuretype='.', parent_featuretype='.'):
    features = list(db.children(txid, featuretype=featuretype, order_by='start'))
    if len(features) == 0:
        raise Exception(f'No feature of type "{featuretype}" found for ID "{txid}"')

    tx = db[txid]
    txToGenomeMap = {}
    out = []
    tx_pos = 1
    for feature in features:
        genome_pos = range(feature.start, feature.end + 1)
        for i in genome_pos:
            # We map tx coordinates to genome coordinates in a dictionary like:
            # {1:11, 2:12, 3:100, 4:101, ...}
            # That's horrible but easier later to query
            txToGenomeMap[tx_pos] = i
            tx_pos += 1
        
        if tx_start < tx_pos:
            if txToGenomeMap[tx_start] < feature.start:
                gstart = feature.start
            else:
                gstart = txToGenomeMap[tx_start] 
            
            if tx_end >= tx_pos:
                gend = feature.end
            else:
                gend = txToGenomeMap[tx_end]
            out.append(gffutils.Feature(tx.chrom, '.', child_featuretype, gstart, gend))
            if tx_end < tx_pos:
                break
    parent = gffutils.Feature(tx.chrom, '.', parent_featuretype, out[0].start, out[-1].end)
    out.insert(0, parent)
    return(out)

Using toy data from above:

out = transcriptToGenome(db, 't1', 7, 10, 'exon', child_featuretype='chunk', parent_featuretype='dom')
for f in out:
    print(f)

chr1	.	dom	21	28	.	.	.	
chr1	.	chunk	21	22	.	.	.	
chr1	.	chunk	27	28	.	.	.	

One would need to fill in the source column, the attributes, strand, etc but that's relatively easy and it depends on context.

dariober avatar Jan 05 '23 15:01 dariober

I'd be really glad to have a clean gffutils based solution for this as well! A while back I tried to put together a utility to convert transcript coordinate SAM files to genomic coordinates (https://github.com/mdshw5/transcoorder) but it's not really correct and doesn't handle reads spanning exon-exon junctions.

mdshw5 avatar Jan 06 '23 01:01 mdshw5