bed_annotation icon indicating copy to clipboard operation
bed_annotation copied to clipboard

Annotate based on refGene.*.txt.gz

Open vladsavelyev opened this issue 8 years ago • 1 comments

Currently, annotation data has to be initialised in 2 steps:

  1. Make RefSeq_knownGene.*.txt using UCSC browser,
  2. Run generate_refseq_data.py that reads RefSeq_knownGene.*.txt and generates all_features.*.bed. After that, annotate_bed.py can use all_features.*.bed to annotate BED files on request.

I want to avoid that initialization step, and store the original RefSeq file that can be directly downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz, and same for hg38). This file can be more easily supported and updated (RefSeq has a new release every day), and I can integrate annotation into BCBio and use their cool system to update reference data.

So I want annotate_bed.py to be able to work directly from refGene.*.txt.gz. The files are already downloaded, and I created function get_refseq_gene(genome) in __init__.py that returns the path.

Since the file is gzipped, probably it can be tabixed and used more effectively to annotate. Currently annotate_bed.py uses all_features.*.bed that is sorted so bedtools intersect can work fast, but it shouldn't be such easy with refGene.*.txt.gz.

vladsavelyev avatar Jun 30 '16 22:06 vladsavelyev

results differs much when using ucsc refflat to annotate, can you add such function, thanks a lot

worker000000 avatar Sep 14 '21 05:09 worker000000