Speedup of peptide parsing and annotation

Open jspaezp opened this issue 10 months ago • 1 comments

This PR implements 4 main things, all with the purpose of improving speed of spectrum annotation workflows.

A fast-pass for unmodified peptides during the parsing.
The option for a simpler parsing grammar.
LRU caching of the parser (read once per session, not once per parse of a proforma sequence)
The option to annotate spectra passing a list of proteoforms directly (instead of a sequence)
- This feature is critical for me, since I have a workflow that uses both the proteoforms directly and the annotated spectra. Therefore by itself makes my workflow 2x faster.

Benchmarks

Using some dummy peptide examples the speedup i see in the parsing is:

With mods

29.51it/s -> (baseline), greedy loading, no fastpass 137.54it/s -> + unmod fastpass, cached full parser (4x improve) 168.48it/s -> + simple parser (1.22x improve,~6x from baseline)

Without mods

34.18it/s -> (baseline) greedy loading, no fastpass 995089.92it/s -> + unmod fastpass, cached full parser (~ 30000x improve) 1081006.19it/s -> + simple parser (equivalent for practical purposes)

On a heavy annotation workflow I have these changes dropped the run time from 45 mins to 2.20 :P

LMK what you think! Best

Apr 02 '24 06:04 jspaezp

btw the tests that involve reading from USI are also breaking on master on my local system.

Apr 02 '24 06:04 jspaezp

@bittremieux added the suggestions, LMK what you think!

Apr 16 '24 20:04 jspaezp

spectrum_utils spectrum_utils copied to clipboard

Speedup of peptide parsing and annotation

Benchmarks

With mods

Without mods

spectrum_utils
spectrum_utils copied to clipboard