gatk icon indicating copy to clipboard operation
gatk copied to clipboard

Create tool for producing genomic regions (as a BED file)

Open LeeTL1220 opened this issue 4 years ago • 1 comments

Feature request

Tool(s) or class(es) involved

This is a request for a new tool GencodeRegionsAsBED

Description

Given a GENCODE gtf, create a BED file with the region of the genes. Each row is a gene.

Suggestion: This can be implemented as a FeatureWalker<GencodeGtfFeature>

Requirements

  • [P0] Union all basic, coding transcripts to determine region. "basic" is a tag, defined by GENCODE, that appears on transcripts in the gtf.
  • [P0] Include option to separate each row by the transcript, as well. I.e. Each row is a transcript. Please include gene and transcript id in the output BED. Transcript entries should be sorted in natural order (in this case, natural order and alphabetical order will be the same).
  • [P0] Must support GENCODE v35 and above (through the latest at the time of the implementation)
  • [P0] Supports hg38 (note that this is implicit in the GENCODE version)
  • [P2] Include option that will create the BED file based on both basic and non-basic transcripts
  • [P2] Include option that will create the BED file based on both coding and non-coding transcripts
  • [P2] Include option to break out exon vs intron vs UTR, etc.
  • [P2] Support hg19/b37, which means supporting earlier versions of GENCODE.

[P0] = "Must have. Cannot close this issue without this feature or without filing another issue. This tool is not considered complete without this feature." [P2] = "Not required. This tool can be considered complete without this feature. No need to ask permission to drop it. If it is NOT delivered, please mention what P2's were not delivered in the closing comment of this issue."

Example output

BED is tab-delimited...

...
chr22	21759657	21867680	MAPK1
...

With transcript option:

...
chr22	21759657	21867645	MAPK1,ENST00000215832.11
chr22	21769040	21867680	MAPK1,ENST00000398822.7
chr22	21769204	21867440	MAPK1,ENST00000544786.1
...

Note: The union of the transcript regions is reported when the transcript option is not present.

LeeTL1220 avatar Mar 24 '21 15:03 LeeTL1220