bed_annotation
bed_annotation copied to clipboard
Annotate based on refGene.*.txt.gz
Currently, annotation data has to be initialised in 2 steps:
- Make
RefSeq_knownGene.*.txt
using UCSC browser, - Run
generate_refseq_data.py
that readsRefSeq_knownGene.*.txt
and generatesall_features.*.bed
. After that,annotate_bed.py
can useall_features.*.bed
to annotate BED files on request.
I want to avoid that initialization step, and store the original RefSeq file that can be directly downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz, and same for hg38). This file can be more easily supported and updated (RefSeq has a new release every day), and I can integrate annotation into BCBio and use their cool system to update reference data.
So I want annotate_bed.py
to be able to work directly from refGene.*.txt.gz
. The files are already downloaded, and I created function get_refseq_gene(genome)
in __init__.py
that returns the path.
Since the file is gzipped, probably it can be tabixed and used more effectively to annotate. Currently annotate_bed.py
uses all_features.*.bed
that is sorted so bedtools intersect
can work fast, but it shouldn't be such easy with refGene.*.txt.gz
.
results differs much when using ucsc refflat to annotate, can you add such function, thanks a lot