bioawk
bioawk copied to clipboard
Make array of VCF info and GFF group field.
GTF file:
$ zcat ./test.gtf.gz
chr21 hg19_knownGene exon 33026871 33027740 0.000000 - . gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
chr21 hg19_knownGene exon 33030247 33030540 0.000000 - . gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
chr21 hg19_knownGene exon 33031710 33031813 0.000000 - . gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
chr21 hg19_knownGene start_codon 33032083 33032085 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33032083 33032154 0.000000 + 0 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33031935 33032154 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33036103 33036199 0.000000 + 0 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33036103 33036199 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33038762 33038831 0.000000 + 2 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33038762 33038831 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33039571 33039688 0.000000 + 1 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33039571 33039688 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene CDS 33040784 33040888 0.000000 + 0 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene stop_codon 33040889 33040891 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
chr21 hg19_knownGene exon 33040784 33041243 0.000000 + . gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
List some fields of GTF with bioawk:
$ ./bioawk -c gff '{ print $feature,$start,$group }' ./test.gtf.gz
exon 33026871 gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
exon 33030247 gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
exon 33031710 gene_id "uc002yoz.1"; transcript_id "uc002yoz.1";
start_codon 33032083 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33032083 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33031935 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33036103 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33036103 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33038762 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33038762 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33039571 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33039571 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
CDS 33040784 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
stop_codon 33040889 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
exon 33040784 gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";
It would be nice if for the GFF group field and the VCF info field, $group and $info, is an array which used each subfeature as key:
$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
exon 33026871 uc002yoz.1 uc002yoz.1
exon 33030247 uc002yoz.1 uc002yoz.1
exon 33031710 uc002yoz.1 uc002yoz.1
start_codon 33032083 uc002ypa.3 uc002ypa.3
CDS 33032083 uc002ypa.3 uc002ypa.3
exon 33031935 uc002ypa.3 uc002ypa.3
CDS 33036103 uc002ypa.3 uc002ypa.3
exon 33036103 uc002ypa.3 uc002ypa.3
CDS 33038762 uc002ypa.3 uc002ypa.3
exon 33038762 uc002ypa.3 uc002ypa.3
CDS 33039571 uc002ypa.3 uc002ypa.3
exon 33039571 uc002ypa.3 uc002ypa.3
CDS 33040784 uc002ypa.3 uc002ypa.3
stop_codon 33040889 uc002ypa.3 uc002ypa.3
exon 33040784 uc002ypa.3 uc002ypa.3
At the moment it is not supported and I get the following error:
./bioawk: can't assign to group; it's an array name.
source line number 1
This:
$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
should be:
$ ./bioawk -c gff '{ print $feature,$start,$group["gene_id"],$group["transcript_id"] }' ./test.gtf.gz
of course.