seqkit subseq enhancement - fasta ID and comment for bed, gtf and gff

Hi,

thanks so much for seqkit - it's one of the tools I use almost on a daily basis and I love it!

There is one feature though, that I feel is lacking a bit. I regularly use seqkit subseq to extract features based on annotations files. Ideally, I would like to have more control over the ID that is given to the extracted fasta sequences and possibly the comment as well. Usually, my annotations come in gff3 (which is not supported), so I convert to bed, and then I use a perl-one-liner to turn the fasta comment into the fasta ID. It works, but since I use it so often, and I guess others might too, it feels like it would be very useful if seqkit natively provided that functionality.

This issue has been raised before: #154, #89

# foo.fna
>foo
GTTTTGTTGACCAGACGATA

# foo.gff
foo     .       .       3       10      .       +       .       ID=foo_1;Name=gene-foo;product="bla bla bla";

# This is how most my calls look - it works but it feels a bit clunky :)
seqkit subseq --bed <(gff2bed foo.gff) foo.fna | perl -pe 's/>\S+\s/>/' 
[INFO] read BED file ...
Processing foo.gff
[INFO] 1 BED features loaded
[INFO] create FASTA index for foo.fna
>foo_1
TTTGTTGA

So what I would which for in terms of enhancements for seqkit would be:

gff3 support: gff3 seems so similar to gtf that it feels like adding support for it would not be a big issue, the parsing of tags would have to be modified slightly. But I think it would be very useful to many users.
ID tag: A flag like --id-tag could be added, similar to --gtf-tag. If set for gtf (/gff) it would tell which tag to use as ID instead of the default <seq_id:from-to> pattern. For bed, it could just accept a number indicating the column that should be used for IDs...
In a perfect world there would also be an option to add additional tags/columns into the comment section of the fasta header, i.e. something multiple --gtf-tags gene_id,transcript_id,product.

Obviously, those aren't crucial issues, but I think they would help to further improve seqkit API. Cheers Thomas

Jan 22 '21 09:01 thackl

Thank you Thomas!

Hmm, seqkit does not mean to implement comprehensive gene feature files manipulations that bedtools and bedops have already provided.

gff3 support: gff3 seems so similar to gtf that it feels like adding support for it would not be a big issue, the parsing of tags would have to be modified slightly. But I think it would be very useful to many users.

GFF3 is much more complicated, I think.

ID tag: A flag like --id-tag could be added, similar to --gtf-tag. If set for gtf (/gff) it would tell which tag to use as ID instead of the default <seq_id:from-to> pattern. For bed, it could just accept a number indicating the column that should be used for IDs...

Good point.

In a perfect world there would also be an option to add additional tags/columns into the comment section of the fasta header, i.e. something multiple --gtf-tags gene_id,transcript_id,product.

I don't know, may be try later.

Sorry, I'm a little bit busy these days. I may handle this when I'm free.

Jan 22 '21 11:01 shenwei356

Sorry, I'm a little bit busy these days. I may handle this when I'm free.

Sure, no worries! This is really just cherries on the cake. But just in case you pick it up at some point:

GFF3 is much more complicated [than GTF], I think.

Actually not at all. GFF3 derives from GTF and can store more complicated features, but when it comes to the 9th columns with the attributes, they are almost the same from a parsing perspective (they have different reserved keys but we don't care about that here)

# attributes in gtf (9th field)
gene_id "ENSG00000223972"; gene_name "DDX11L1"; gene_source "havana";...
# Same attributes in gff3 (9th field)
gene_id=ENSG00000223972;gene_name=DDX11L1;gene_source=havana;...

The differences in gff3: key and value are separated by "=" instead of a space; values are not enclosed in "". And no extra space before ";". In Perl, for example, they could be read with one and the same regexp

perl -MData::Dumper -ne 'chomp(); 
  # captures:        (key)      (value)
  %tags = m/(?:^|;) ?(\w+)[ =]"?([^;"]+)/g;
print Dumper(\%tags)' gtf-gff3-tags.txt
$VAR1 = {
          'gene_source' => 'havana',
          'gene_name' => 'DDX11L1',
          'gene_id' => 'ENSG00000223972'
        };
$VAR1 = {
          'gene_source' => 'havana',
          'gene_name' => 'DDX11L1',
          'gene_id' => 'ENSG00000223972'
        };

Jan 23 '21 14:01 thackl

seqkit seqkit copied to clipboard

subseq enhancement - fasta ID and comment for bed, gtf and gff

seqkit
seqkit copied to clipboard