readytowear
readytowear copied to clipboard
Clarification: GTDB mixed orientation warning only applies to full length refs?
Howdy!
I noticed the disclaimer on the GTDB data page:
Warning: files in this directory are experimental. Many of the reference sequences appear to be in mixed orientations, which currently are not handled well by q2-feature-classifier and may yield misleading results. Use at your own risk.
Which is of course a valid concern, but this only applies to the full length refs, right? Since the V4 ones go through the extract read
process initially which correct these mixed orientations? (as per @nbokulich's note here).
Or... does extract read
's --p-read-orientation both
only apply both orientation in its search and doesn't actually correct the reads in the output?
Thanks!
howdy!
yes that's correct — extract reads will orient the sequences (as long as the primers hit the F/RC sequences).
would you like to modify this warning to clarify? Personally I don't see much harm in keeping the "experimental" label (since to my knowledge we have not really tested the GTDB bespoke weights extensively), but it would be good to clarify.
Another future option (for the FL seqs) would be to use RESCRIPt to re-orient the reads.
Sounds good, updated an extra line on this warning in a PR.
As for using RESCRIPt to fix the full-length reads, would that need another set of reference reads to align against? In that case that may need some benchmarking to fine-tune the alignment parameters right?
Another alternative approach that Ben suggested some time ago was to create a new database with all reads in both orientation. Would take twice as long but wouldn't need benchmarking and fine-tuning. Unless reads can be in reverse, reverse-complement, or some other combination.
I agree, orienting in the same direction might need a little bit of testing to establish a working protocol, but one could use fairly loose %id and coverage settings to re-align against a small reference db of sequences in a known orientation. I am not sure that I would call it benchmarking per se.
A database in both orientations might actually require more benchmarking in my opinion than attempting to orient all in the same direction, since this could lead to changes in classifier performance.
Gotcha! I didn't realize that would change classifier performance. What do you reckon a good starting database and %id Something like 65% GG at 65% coverage?
yeah that sounds reasonable... I think that %id is approx what deblur uses for pre-filtering reads, so maybe we can use that as precedent.