bedtools icon indicating copy to clipboard operation
bedtools copied to clipboard

BED/GFF headers

Open fgvieira opened this issue 12 years ago • 5 comments

Right now they have to start with '#' (comment) and are generally discarded from output (eg. subtractBed).

It would be nice if headers could be properly handled and printed to the output. Maybe add an option that would not parse the first line and just print it accordingly.

fgvieira avatar May 08 '13 21:05 fgvieira

:+1: In particular, the GFF version 3 header ##gff-version 3 must be maintained.

sjackman avatar May 21 '14 18:05 sjackman

Is the -header option not working for you? The example below is from the bedtools2 repository (please file issues there):

curl -s ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz | gzcat | head -20 > test.gtf

bedtools --version
bedtools v2.19.1

bedtools intersect -header -a test.gtf -b test.gtf | head
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: [email protected]
##format: gtf
##date: 2013-12-05
chr1    HAVANA  gene    11869   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    11869   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    13221   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

arq5x avatar May 21 '14 18:05 arq5x

Ah, I see now that you are referring to the fact that some of the tools don't support this functionality. In bedtools2, we are slowly working through standardizing the API for all of the tools. Once done, the result will be that all of the tools (when relevant) will support the -header option.

arq5x avatar May 21 '14 18:05 arq5x

bedtools sort -header works perfect! Thanks, Aaron. I was reading this documentation which doesn't show the -header option.

I expected -header to be the default behaviour. Perhaps instead a -noheader option?

sjackman avatar May 21 '14 18:05 sjackman

I see your point Shaun. The problem with this, however, is that such a change could impact many existing pipelines that are crafted around the assumption that headers will not be emitted by default. I think once we standardize the API this would be something worth revisiting with users on the mailing list to seek feedback about the impact.

arq5x avatar May 21 '14 18:05 arq5x