genometools
genometools copied to clipboard
extractfeat: -matchdescstart should probably be set by default
From #592
extractfeat tries to pick reference sequences using MD5 tag by default. To look up by id, -matchdescstart option must be used.
I am not sure if it's common to refer to ref seqs by their MD5 in GFF files. I think mapping by seqid is more common. And -matchdescstart option is not intuitive - even after reading the docs it's not clear that this is the option one is looking for. I resorted to trial and error to pick b/w -matchdesc, -usedesc, and -matchdescstart.
I suggest setting -matchdescstart the default, or removing the option altogether if gt can intelligently decide whether to use MD5 or seqid (so as to not break the current behavior).
No, it's not common to use MD5 IDs in GFF3 files. But I think the tight coupling between annotations and sequences this produces prevents many errors further down the line. When I was using the loose coupling (the different methods to match annotations with sequences) it often led to very hard to find problems further down the line, because you have to use the correct option for every tool in the pipeline. If you miss it once the results are wrong in a very hard to debug way. With MD5 IDs thats not possible anymore.
Therefore I think the best approach is to bind annotations to sequences is to use gt id_to_md5 consciously once (with the correct options) and from there on its error proof.
To integrate such GFF3 annotations with other tools one can convert them back with gt md5_to_id.
I agree that the options are somewhat confusing and might be improved. The reason it ended up that way is that GFF3 lacks a tight coupling between sequences and annotations and I introduced different options to do the matching as they appeared in the wild. In my opinion there is to good way to do that short of extending GFF3 to introduce tight coupling (as I did with the MD5 sequence IDs).
What could be possible is to always look for MD5 hashes, regardless of which option (-matchdesc, -usedesc, or -matchdescstart) was chosen. This way one can pick a sensible default without impacting users with a pure MD5 workflow. Comments?
Like, first check if the GFF seqid is MD5 hash of the one of the sequences in the FASTA file then use that (default). If no, then check if the GFF seqid matches exactly to a seqid in the FASTA file (-matchdescstart). Right?
What if the GFF file has some seqids as MD5, and others normal. I can't imagine why anyone would do that - just for discussion. GT could decide MD5 or -matchdescstart based on the first entry and apply that to all, thus such a GFF will fail. Alternatively, MD5 or -matchdescstart check can be done for each entry. What did you have in mind @satta?
That sounds like a good approach to me. I would prefer allowing both in one file, if that is feasible. Otherwise deciding in the beginning should be good enough.
Let's make sure that a MD5 tag is always tried first for each individual case. This basically incorporates the current default into all the other matching strategies, and we can make -matchdescstart the default. It is by far the fastest of all methods (as it uses a hash table to look up the sequence for a feature) and it makes sure there is a 1:1 mapping between seqid and sequence (unlike -matchdesc).
This approach would allow to mix MD5 and non-MD5 tagged GFFs in one file.
Sounds reasonable. Who wants to tackle it?