galaxytools
galaxytools copied to clipboard
fgrep (grep -F) option
fgrep
functionality (available with grep -F
) allows searching for m
multiple fixed strings among n
sequences in O(n) time rather than O(n*m) by leveraging the Aho-Corasick algorithm. For a concrete example, I have a fasta_to_tabular
result (20,000 lines) that I want to search for many accession IDs (8,000); or, I might just as easily wish to search for a large number of arbitrary peptide sequences.
So, my issue (or question) is the approach to take:
- If it's not good to modify the "Search in textfiles (grep)" tool, is there another tool that is a good fit?
- Historically, fgrep functionality was merged into grep;
- this may make sense to the standards developers, but bioinformaticians may not immediately assume that a tool labeled "
grep
" might be used with fixed strings, even though they are technically regular expressions matching one sequence.
- if it's good to modify the "Search in textfiles (grep)" tool, the change that seems logical to me is:
- Add a fourth option to Type of regex, e.g., "list of fixed strings (fgrep)";
- and, when that option is chosen, enable an input field for a file of fixed strings, e.g., "File of fixed strings (one per line)".
- When a dataset collection or multiple datasets are specified, they would be concatenated into a single file of substrings before invoking
grep -F
.
- When a dataset collection or multiple datasets are specified, they would be concatenated into a single file of substrings before invoking
@bgruening Would you suggest that I submit a PR for the "Search in textfiles (grep)" tool?
@bgruening Would you suggest that I submit a PR for the "Search in textfiles (grep)" tool?
Yes, I think so :)
Thanks and sorry for my late reply.