sleuth deal with suffix id in ensembl transcriptomes

deal with suffix id in ensembl transcriptomes

Open pimentel opened this issue 9 years ago • 4 comments

It seems that ensembl appends '.X' to the transcript id denoting the version of that transcript.

http://sites.tufts.edu/cbi/files/2013/01/Introduction2ENSEMBL.pdf https://www.biostars.org/p/102769/

When there is a target_mapping provided and the intersection between the transcript ids and the mapping is empty, we should see if we can "fix it" by chopping off the '.X'

This was originally reported on the Google group:

https://groups.google.com/d/msg/kallisto-sleuth-users/NoT4ZD8SjEE/UZx9WBVWCQAJ

Jan 12 '16 01:01 pimentel

I have this problem, too. could you find a solution for that?

Mar 03 '16 10:03 telia22

@telia22 you can trim the ensembl IDs yourself before passing them to sleuth. Something like this should work:

t2g$target_id <- gsub(t2g$target_id, pattern="\\.[0-9]+$", replacement="")

Mar 03 '16 12:03 blahah

If you would like to keep the versions, I have a quick and dirty work around for the moment:

gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz

sed -E 's/^>(ENST[0-9]{2,}.[0-9]{1,}).(ENSG[0-9]{2,}.[0-9]{1,})./\1,\2/' Homo_sapiens.GRCh38.cdna.all.fa | grep "^ENST" > tx2gene.csv

Once in R: t2g <- read.csv(file="tx2gene.csv", sep=",", header=FALSE, col.names=c("target_id", "gene_id"))

Nov 09 '17 14:11 jmillar201

One way is to get with BioMart the IDs with the version information as well. Something like:


t2g<-getBM(attributes=c('ensembl_transcript_id_version','ensembl_gene_id_version','external_gene_name'), mart = ensembl)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id_version,
                     ens_gene = ensembl_gene_id_version, ext_gene = external_gene_name)

Nov 13 '18 17:11 tiagobrc

sleuth sleuth copied to clipboard

deal with suffix id in ensembl transcriptomes

sleuth
sleuth copied to clipboard