rnaseqlib icon indicating copy to clipboard operation
rnaseqlib copied to clipboard

Creating annotation from a gtf? (+ALE/AFE/TandemUTRs?)

Open olgabot opened this issue 11 years ago • 5 comments

Are there plans to create an annotate_events.py type script that could take a gtf (such as a gencode GTF) and create an annotation? I'd like to be able to rerun annotation creation with each new gencode release, rather than waiting for UCSC to update their tables. I just spent some time reading the code last night so I might have missed this option.

Also, the first annotation versions have Alternative First Exon (AFE), Alt Last Exon (ALE) and TandemUTR annotations, but I didn't see these as options in gff_annotate_events.py. Are there plans to add these?

Thanks! Olga

olgabot avatar Feb 20 '14 17:02 olgabot

Hi Olga,

You can use the existing code to create events from GTF by first converting the GTF of interest (e.g. Gencode) to a genePred format, and then feeding it to annotate_events.py. The information is the same, just a listing of which exons go in what transcript, it's just a different format for specifying it.

AFE/ALE and TandemUTR is on my list for gff_annotate_events.py. It won't happen for a few weeks I predict. AFE/ALE is straightforward, but for TandemUTR we'll need a good source for TandemUTRs. UCSC altEvents is probably not the best choice, so we've been using polyA db. If you have thoughts on the best input annotation for this, let me know.

Best, --Yarden

yarden avatar Feb 20 '14 19:02 yarden

Hmm, annotate_events.py seems to be missing the function make_annotation which does exist in gff_annotate_events.py but that version seems to be incompatible with the args of annotate_events.py. Suggestions?

olgabot avatar Feb 24 '14 18:02 olgabot

What operation on GFFs are you trying to do? Are you making gene annotations for a GFF?

I think annotate_events.py is deprecated. I just checked in code to remove that dangling function (see clip branch).

The correct script is gff_annotate_events, which after installation gets made as a binary script (note the lack of .py extension). Same for gff_make_annotation. This is how it runs for me:

$ gff_make_annotation --help
/home/yarden/jaen/.local/bin/gff_make_annotation:5: UserWarning: Module mpl_toolkits was already imported from None, but /usr/local/lib/python2.7/dist-packages/matplotlib-1.2.0-py2.7-linux-x86_64.egg is being added to sys.path
  from pkg_resources import load_entry_point
usage: gff_make_annotation [-h] [--flanking-rule FLANKING_RULE] [--multi-iso]
                           [--genome-label GENOME_LABEL] [--sanitize]
                           tables_dir output_dir

positional arguments:
  tables_dir            Directory where UCSC tables are. These are used in
                        making the annotation.
  output_dir            Output directory.

optional arguments:
  -h, --help            show this help message and exit
  --flanking-rule FLANKING_RULE
                        Rule to use when defining exon trios. E.g.
                        'commonshortest' to use the most common and shortest
                        regions are flanking exons to an alternative trio.
  --multi-iso           If passed, generates multi-isoform annotations. Off by
                        default.
  --genome-label GENOME_LABEL
                        If given, used as label for genome in output files.
  --sanitize            If passed, sanitize the annotation. Off by default.

and:

$ gff_annotate_events --help
/home/yarden/jaen/.local/bin/gff_annotate_events:5: UserWarning: Module mpl_toolkits was already imported from None, but /usr/local/lib/python2.7/dist-packages/matplotlib-1.2.0-py2.7-linux-x86_64.egg is being added to sys.path
  from pkg_resources import load_entry_point
usage: gff_annotate_events [-h] [--in-place] gff_filename table_filename

positional arguments:
  gff_filename    GFF filename to annotate with gene information.
  table_filename  Table contains txStart/txEnd sites and the gene fields.

optional arguments:
  -h, --help      show this help message and exit
  --in-place      If passed, outputs annotation in place (i.e. overwriting the
                  passed in file.) Also sanitizes the GFF.

yarden avatar Feb 24 '14 20:02 yarden

Yes, I'm trying to make an annotation starting from a gencode gtf. I have the genePred version of the file (which I'm guessing is a "table"? but this is unclear). What goes in the "tables_dir" for gff_make_annotation? And what is the order of operations? 1. gff_make_annotation and then 2. gff_annotate_events ?


Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory http://yeolab.ucsd.edu/yeolab/Home.html | Sanford Consortium for Regenerative Medicine University of California, San Diego www http://olgabotvinnik.com | blog http://blog.olgabotvinnik.com/ | github http://github.com/olgabot | twitter http://twitter.com/olgabot | linkedin http://www.linkedin.com/in/olgabotvinnik

On Mon, Feb 24, 2014 at 12:05 PM, Yarden Katz [email protected]:

What operation on GFFs are you trying to do? Are you making gene annotations for a GFF?

I think annotate_events.py is deprecated. I just checked in code to remove that dangling function (see clip branch).

The correct script is gff_annotate_events, which after installation gets made as a binary script (note the lack of .py extension). Same for gff_make_annotation. This is how it runs for me:

$ gff_make_annotation --help /home/yarden/jaen/.local/bin/gff_make_annotation:5: UserWarning: Module mpl_toolkits was already imported from None, but /usr/local/lib/python2.7/dist-packages/matplotlib-1.2.0-py2.7-linux-x86_64.egg is being added to sys.path from pkg_resources import load_entry_point usage: gff_make_annotation [-h] [--flanking-rule FLANKING_RULE] [--multi-iso] [--genome-label GENOME_LABEL] [--sanitize] tables_dir output_dir

positional arguments: tables_dir Directory where UCSC tables are. These are used in making the annotation. output_dir Output directory.

optional arguments: -h, --help show this help message and exit --flanking-rule FLANKING_RULE Rule to use when defining exon trios. E.g. 'commonshortest' to use the most common and shortest regions are flanking exons to an alternative trio. --multi-iso If passed, generates multi-isoform annotations. Off by default. --genome-label GENOME_LABEL If given, used as label for genome in output files. --sanitize If passed, sanitize the annotation. Off by default.

and:

$ gff_annotate_events --help /home/yarden/jaen/.local/bin/gff_annotate_events:5: UserWarning: Module mpl_toolkits was already imported from None, but /usr/local/lib/python2.7/dist-packages/matplotlib-1.2.0-py2.7-linux-x86_64.egg is being added to sys.path from pkg_resources import load_entry_point usage: gff_annotate_events [-h] [--in-place] gff_filename table_filename

positional arguments: gff_filename GFF filename to annotate with gene information. table_filename Table contains txStart/txEnd sites and the gene fields.

optional arguments: -h, --help show this help message and exit --in-place If passed, outputs annotation in place (i.e. overwriting the passed in file.) Also sanitizes the GFF.

Reply to this email directly or view it on GitHubhttps://github.com/yarden/rnaseqlib/issues/10#issuecomment-35929388 .

olgabot avatar Feb 24 '14 20:02 olgabot

You first gff_make_annotation and then annotate the resulting GFF with genes. The table argument to gff_make_annotation is a genePred UCSC table.

Keep in mind rnaseqlib is unpublished code, basically personal scripts, so much of it is undocumented

Sent from a mobile device

On Feb 24, 2014, at 3:12 PM, Olga Botvinnik [email protected] wrote:

Yes, I'm trying to make an annotation starting from a gencode gtf. I have the genePred version of the file (which I'm guessing is a "table"? but this is unclear). What goes in the "tables_dir" for gff_make_annotation? And what is the order of operations? 1. gff_make_annotation and then 2. gff_annotate_events ?


Olga Botvinnik PhD Program in Bioinformatics and Systems Biology Gene Yeo Laboratory http://yeolab.ucsd.edu/yeolab/Home.html | Sanford Consortium for Regenerative Medicine University of California, San Diego www http://olgabotvinnik.com | blog http://blog.olgabotvinnik.com/ | github http://github.com/olgabot | twitter http://twitter.com/olgabot | linkedin http://www.linkedin.com/in/olgabotvinnik

On Mon, Feb 24, 2014 at 12:05 PM, Yarden Katz [email protected]:

What operation on GFFs are you trying to do? Are you making gene annotations for a GFF?

I think annotate_events.py is deprecated. I just checked in code to remove that dangling function (see clip branch).

The correct script is gff_annotate_events, which after installation gets made as a binary script (note the lack of .py extension). Same for gff_make_annotation. This is how it runs for me:

$ gff_make_annotation --help /home/yarden/jaen/.local/bin/gff_make_annotation:5: UserWarning: Module mpl_toolkits was already imported from None, but /usr/local/lib/python2.7/dist-packages/matplotlib-1.2.0-py2.7-linux-x86_64.egg is being added to sys.path from pkg_resources import load_entry_point usage: gff_make_annotation [-h] [--flanking-rule FLANKING_RULE] [--multi-iso] [--genome-label GENOME_LABEL] [--sanitize] tables_dir output_dir

positional arguments: tables_dir Directory where UCSC tables are. These are used in making the annotation. output_dir Output directory.

optional arguments: -h, --help show this help message and exit --flanking-rule FLANKING_RULE Rule to use when defining exon trios. E.g. 'commonshortest' to use the most common and shortest regions are flanking exons to an alternative trio. --multi-iso If passed, generates multi-isoform annotations. Off by default. --genome-label GENOME_LABEL If given, used as label for genome in output files. --sanitize If passed, sanitize the annotation. Off by default.

and:

$ gff_annotate_events --help /home/yarden/jaen/.local/bin/gff_annotate_events:5: UserWarning: Module mpl_toolkits was already imported from None, but /usr/local/lib/python2.7/dist-packages/matplotlib-1.2.0-py2.7-linux-x86_64.egg is being added to sys.path from pkg_resources import load_entry_point usage: gff_annotate_events [-h] [--in-place] gff_filename table_filename

positional arguments: gff_filename GFF filename to annotate with gene information. table_filename Table contains txStart/txEnd sites and the gene fields.

optional arguments: -h, --help show this help message and exit --in-place If passed, outputs annotation in place (i.e. overwriting the passed in file.) Also sanitizes the GFF.

Reply to this email directly or view it on GitHubhttps://github.com/yarden/rnaseqlib/issues/10#issuecomment-35929388 .

— Reply to this email directly or view it on GitHub.

yarden avatar Feb 24 '14 20:02 yarden