spectrum_utils
spectrum_utils copied to clipboard
Speedup of peptide parsing and annotation
This PR implements 4 main things, all with the purpose of improving speed of spectrum annotation workflows.
- A fast-pass for unmodified peptides during the parsing.
- The option for a simpler parsing grammar.
- LRU caching of the parser (read once per session, not once per parse of a proforma sequence)
- The option to annotate spectra passing a list of proteoforms directly (instead of a sequence)
- This feature is critical for me, since I have a workflow that uses both the proteoforms directly and the annotated spectra. Therefore by itself makes my workflow 2x faster.
Benchmarks
Using some dummy peptide examples the speedup i see in the parsing is:
With mods
29.51it/s -> (baseline), greedy loading, no fastpass 137.54it/s -> + unmod fastpass, cached full parser (4x improve) 168.48it/s -> + simple parser (1.22x improve,~6x from baseline)
Without mods
34.18it/s -> (baseline) greedy loading, no fastpass 995089.92it/s -> + unmod fastpass, cached full parser (~ 30000x improve) 1081006.19it/s -> + simple parser (equivalent for practical purposes)
On a heavy annotation workflow I have these changes dropped the run time from 45 mins to 2.20 :P
LMK what you think! Best
btw the tests that involve reading from USI are also breaking on master on my local system.
@bittremieux added the suggestions, LMK what you think!