sleuth
sleuth copied to clipboard
deal with suffix id in ensembl transcriptomes
It seems that ensembl appends '.X' to the transcript id denoting the version of that transcript.
http://sites.tufts.edu/cbi/files/2013/01/Introduction2ENSEMBL.pdf https://www.biostars.org/p/102769/
When there is a target_mapping
provided and the intersection between the transcript ids and the mapping is empty, we should see if we can "fix it" by chopping off the '.X'
This was originally reported on the Google group:
https://groups.google.com/d/msg/kallisto-sleuth-users/NoT4ZD8SjEE/UZx9WBVWCQAJ
I have this problem, too. could you find a solution for that?
@telia22 you can trim the ensembl IDs yourself before passing them to sleuth. Something like this should work:
t2g$target_id <- gsub(t2g$target_id, pattern="\\.[0-9]+$", replacement="")
If you would like to keep the versions, I have a quick and dirty work around for the moment:
gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
sed -E 's/^>(ENST[0-9]{2,}.[0-9]{1,}).(ENSG[0-9]{2,}.[0-9]{1,})./\1,\2/' Homo_sapiens.GRCh38.cdna.all.fa | grep "^ENST" > tx2gene.csv
Once in R: t2g <- read.csv(file="tx2gene.csv", sep=",", header=FALSE, col.names=c("target_id", "gene_id"))
One way is to get with BioMart the IDs with the version information as well. Something like:
t2g<-getBM(attributes=c('ensembl_transcript_id_version','ensembl_gene_id_version','external_gene_name'), mart = ensembl)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id_version,
ens_gene = ensembl_gene_id_version, ext_gene = external_gene_name)