sleuth icon indicating copy to clipboard operation
sleuth copied to clipboard

deal with suffix id in ensembl transcriptomes

Open pimentel opened this issue 9 years ago • 4 comments

It seems that ensembl appends '.X' to the transcript id denoting the version of that transcript.

http://sites.tufts.edu/cbi/files/2013/01/Introduction2ENSEMBL.pdf https://www.biostars.org/p/102769/

When there is a target_mapping provided and the intersection between the transcript ids and the mapping is empty, we should see if we can "fix it" by chopping off the '.X'

This was originally reported on the Google group:

https://groups.google.com/d/msg/kallisto-sleuth-users/NoT4ZD8SjEE/UZx9WBVWCQAJ

pimentel avatar Jan 12 '16 01:01 pimentel

I have this problem, too. could you find a solution for that?

telia22 avatar Mar 03 '16 10:03 telia22

@telia22 you can trim the ensembl IDs yourself before passing them to sleuth. Something like this should work:

t2g$target_id <- gsub(t2g$target_id, pattern="\\.[0-9]+$", replacement="")

blahah avatar Mar 03 '16 12:03 blahah

If you would like to keep the versions, I have a quick and dirty work around for the moment:

gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz

sed -E 's/^>(ENST[0-9]{2,}.[0-9]{1,}).(ENSG[0-9]{2,}.[0-9]{1,})./\1,\2/' Homo_sapiens.GRCh38.cdna.all.fa | grep "^ENST" > tx2gene.csv

Once in R: t2g <- read.csv(file="tx2gene.csv", sep=",", header=FALSE, col.names=c("target_id", "gene_id"))

jmillar201 avatar Nov 09 '17 14:11 jmillar201

One way is to get with BioMart the IDs with the version information as well. Something like:


t2g<-getBM(attributes=c('ensembl_transcript_id_version','ensembl_gene_id_version','external_gene_name'), mart = ensembl)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id_version,
                     ens_gene = ensembl_gene_id_version, ext_gene = external_gene_name)

tiagobrc avatar Nov 13 '18 17:11 tiagobrc