galaxytools icon indicating copy to clipboard operation
galaxytools copied to clipboard

fgrep (grep -F) option

Open eschen42 opened this issue 3 years ago • 1 comments

fgrep functionality (available with grep -F) allows searching for m multiple fixed strings among n sequences in O(n) time rather than O(n*m) by leveraging the Aho-Corasick algorithm. For a concrete example, I have a fasta_to_tabular result (20,000 lines) that I want to search for many accession IDs (8,000); or, I might just as easily wish to search for a large number of arbitrary peptide sequences.

So, my issue (or question) is the approach to take:

  • If it's not good to modify the "Search in textfiles (grep)" tool, is there another tool that is a good fit?
    • Historically, fgrep functionality was merged into grep;
    • this may make sense to the standards developers, but bioinformaticians may not immediately assume that a tool labeled "grep" might be used with fixed strings, even though they are technically regular expressions matching one sequence.
  • if it's good to modify the "Search in textfiles (grep)" tool, the change that seems logical to me is:
    • Add a fourth option to Type of regex, e.g., "list of fixed strings (fgrep)";
    • and, when that option is chosen, enable an input field for a file of fixed strings, e.g., "File of fixed strings (one per line)".
      • When a dataset collection or multiple datasets are specified, they would be concatenated into a single file of substrings before invoking grep -F.

@bgruening Would you suggest that I submit a PR for the "Search in textfiles (grep)" tool?

eschen42 avatar Jan 19 '22 13:01 eschen42

@bgruening Would you suggest that I submit a PR for the "Search in textfiles (grep)" tool?

Yes, I think so :)

Thanks and sorry for my late reply.

bgruening avatar Feb 01 '22 12:02 bgruening