circDeep icon indicating copy to clipboard operation
circDeep copied to clipboard

End-to-End learning framework for circular RNA classification from other long non-coding RNA using multimodal deep learning

circDeep: Deep learning approach for circular RNA classification from other long non-coding RNA

circDeep fuse Reverse Complement Matching (RCM) descriptor, Asymmetric Convolution Neural Network combined with Long Short Term Memory (ACNN-BLSTM) sequence descriptor and conservation descriptor into high level abstraction descriptors, where the shared representations across different modalities are integrated. The experiments show that circDeep is not only faster than existing tools but also performs at an unprecedented level of accuracy by achieving more than 12 percent increase in accuracy over the existing tools.

Prerequisites

We recommend to use Anaconda 3 platform.

Installation

Download circDeep by

git clone https://github.com/UofLBioinformatics/circDeep

Installation has been tested in Anaconda (Linux/Windows) platform with Python3.

Usage

usage: circDeep.py [-h] --train TRAIN --genome GENOME -gtf GTF --bigwig BIGWIG
               [--seq SEQ] [--rcm RCM] [--cons CONS] [--predict PREDICT]
               [--out_file OUT_FILE] [--model_dir MODEL_DIR] 
               [--positive_bed POSITIVE_BED] [--negative_bed NEGATIVE_BED] 
               [--testing_bed TESTING_BED] 

circular RNA classification from other long non-coding RNA using multimodal deep learning

Required arguments:
=================== 
   --data_dir <data_directory>
                        Under this directory, you will have descriptors files used for training, the label file, genome sequencefile , gtf annotation file and bigwig file
  --train TRAIN         use this option for training model
  --genome GENOME       Genome sequence. e.g., hg38.fa
  --gtf GTF             The gtf annotation file. e.g., hg38.gtf
  --bigwig BIGWIG       conservation scores in bigWig file format
                        
 optional arguments:
====================

   -h, --help            show this help message and exit
  --seq SEQ             The modularity of ACNN-BLSTM seq
  --rcm RCM             The modularity of RCM
  --cons CONS           The modularity of conservation
  --predict PREDICT     Predicting circular RNAs. if using train, then it will
                        be False
  --out_file OUT_FILE   The output file used to store the prediction
                        probability of testing data
  --model_dir MODEL_DIR
                        The directory to save the trained models for future
                        prediction
   --positive_bed POSITIVE_BED
                        BED input file for circular RNAs for training, it
                        should be like:chromosome start end gene
  --negative_bed NEGATIVE_BED
                        BED input file for other long non coding RNAs for
                        training, it should be like:chromosome start end gene
  --testing_bed TESTING_BED
                        BED input file for testing data, it should be
                        like:chromosome start end gene

Example

Train the model:

In our experiements, we have used circular RNAs from circRNADb and our negative dataset from GENCODE. The original coordinates of our datasets were in hg19 genome and we convert them to hg38 genome using liftOver provided in UCSC Genome Browser. We need also to download all necessary files and put them in data directory.

python3 circDeep.py --data_dir 'data/' --train True --model_dir 'models/' --seq True --rcm True --cons True --genome 'data/hg38.fasta' --gtf 'data/Homo_sapiens.Ensembl.GRCh38.82.gtf' --bigwig 'data/hg38.phastCons20way.bw' --positive_bed 'data/circRNA_dataset.bed' --negative_bed 'data/negative_dataset.bed'

Test the model:

python3 circDeep.py --data_dir 'data/' --train False --model_dir 'models/' --seq True --rcm True --cons True --genome 'data/hg38.fasta' --gtf 'data/Homo_sapiens.Ensembl.GRCh38.82.gtf' --bigwig 'data/hg38.phastCons20way.bw' --testing_bed 'data/test.bed'

Note:

Input data files for training and testing should be in bed format:

chr17 17507350 17508308 + gene1

chr11 48014405 48015855 - gene2

chr17 77469161 77472770 - gene3

License

Copyright (C) 2017 . See the LICENSE file for license rights and limitations (MIT).

Citation

Mohamed Chaabane, Robert M Williams, Austin T Stephens, Juw Won Park, circDeep: deep learning approach for circular RNA classification from other long non-coding RNA, Bioinformatics, Volume 36, Issue 1, 1 January 2020, Pages 73–80, https://doi.org/10.1093/bioinformatics/btz537