Ularcirc icon indicating copy to clipboard operation
Ularcirc copied to clipboard

How to bring in custom annotations (BSgenome, TxDb)?

Open mirax87 opened this issue 4 years ago • 4 comments

Hi,

thanks for this interesting tool. I am current trying to get ularcirc to run with some of my data.

Unfortunately, the reference genome for alignments don't match the UCSC chromosome naming conventions, so I thought of creating my own BSgenome and TxDb. I already forged the BSgenome, the TxDb is yet to come.

For now, with the BSgenome loaded in to the name space, I tried to find it in the shiny App under Setup configuration. My custom BSgenome was not listed - I could imagine that it would be due to my missing TxDb (yet to be produced).

My question for you: Is it yet possible to bring in custom genome + annotation and if so, how can I achieve that?

best, -Michael

mirax87 avatar Dec 10 '20 10:12 mirax87

In theory it should be possible to bring in custom genome + annotation. However it will require that an annotation database is available. i.e. Ularcirc first searches for annotation database libraries that is named as follows:

org.<two letter Species code>.eg.db

so for humans this is

org.Hs.eg.db

The two letter code is then used to identify matching genome and transcript data bases.

If an annotation data base library exists for you organism then it sounds like you are very close to having all the required items.

davhum avatar Dec 11 '20 02:12 davhum

What about the BSgenome and TxDB? They seem to be mandatory as well. Also where is the annotation database required to be - it's checking somewhere online, right?

If there is a local installation of the database possible, it would be great, if there was a wrapper, where the user provides the genome fasta, the genome annotation (e.g. gtf) file (and else might be necessary) to bring in custom annotations suitable for ularcirc. Would that be feasible?

mirax87 avatar Feb 03 '21 11:02 mirax87

Agree have a wrapper is a good idea - but I am unsure of what is involved for some of those files. I have experience in making TxDb from gtf, but have not generated genome or annotation database. You mentioned you had generated genome file, was that easy to do? I suspect the annotation database is the most involved.

Perhaps another solution to your problem is to convert your alignment coordinates to UCSC coordinated. I could make a wrapper for that. If you could generate a small test dataset I could generate a simple method to convert to a format that is compatible with existing databases.

davhum avatar Feb 03 '21 11:02 davhum

I thought about the conversion of alignments - or even remapping - but the downstream effects of the conversion will be to costly for me as I am using more tools for circRNA prediction and quantification (mostly from the CIRI world). Thank you for the offer, though.

Regarding the BSgenome, I think it's not too tricky and believe it can be automated (in a wrapper). The BSgenome has some documentation on the how to forge a new one. In brief, you create sort of a dictionary (seed.dcf), with all relevant BSgenome information and compile it with BSgenome::forgeBSgenomeDataPkg. There are more forums and discussions around that can help be of help. Here is the BSgenome documentation, check for 'How to forge a BSgenome data package'.

  • https://bioconductor.org/packages/release/bioc/html/BSgenome.html

This is what the seed.dcf file looks in my case, but cannot guarantee that these are the minimum specs:

Package: BSgenome.dm6.ensembl
Title: "dm6 from local repository"
Description: "compatible with snakePipes alignments"
Version: 0.999                                            # random number
organism: Drosophila_melanogaster
common_name: Fruitfly
provider: FlyBase
provider_version: dm6
release_name: dm6
release_date: 2018_03
source_url: <path to fasta directory>
organism_biocview: dm6_ensembl
BSgenomeObjname: dm6_ensembl
seqs_srcdir: <path to fasta directory>
seqfile_name: genome.2bit                                  # genome in 2bit

Genome fasta to 2bit conversion

  • https://genome.ucsc.edu/goldenPath/help/twoBit.html

mirax87 avatar Feb 03 '21 11:02 mirax87