galaxytools fgrep (grep -F) option

fgrep (grep -F) option

Open eschen42 opened this issue 3 years ago • 1 comments

fgrep functionality (available with grep -F) allows searching for m multiple fixed strings among n sequences in O(n) time rather than O(n*m) by leveraging the Aho-Corasick algorithm. For a concrete example, I have a fasta_to_tabular result (20,000 lines) that I want to search for many accession IDs (8,000); or, I might just as easily wish to search for a large number of arbitrary peptide sequences.

So, my issue (or question) is the approach to take:

If it's not good to modify the "Search in textfiles (grep)" tool, is there another tool that is a good fit?
- Historically, fgrep functionality was merged into grep;
- this may make sense to the standards developers, but bioinformaticians may not immediately assume that a tool labeled "grep" might be used with fixed strings, even though they are technically regular expressions matching one sequence.
if it's good to modify the "Search in textfiles (grep)" tool, the change that seems logical to me is:
- Add a fourth option to Type of regex, e.g., "list of fixed strings (fgrep)";
- and, when that option is chosen, enable an input field for a file of fixed strings, e.g., "File of fixed strings (one per line)".
  - When a dataset collection or multiple datasets are specified, they would be concatenated into a single file of substrings before invoking grep -F.

@bgruening Would you suggest that I submit a PR for the "Search in textfiles (grep)" tool?

Jan 19 '22 13:01 eschen42

@bgruening Would you suggest that I submit a PR for the "Search in textfiles (grep)" tool?

Yes, I think so :)

Thanks and sorry for my late reply.

Feb 01 '22 12:02 bgruening

galaxytools galaxytools copied to clipboard

fgrep (grep -F) option

galaxytools
galaxytools copied to clipboard