searchgui Integration with Sage

Hi,

Is there any interest in integrating additional search engines? Sage is an open source (MIT licensed) search engine that utilizes the ion-indexing approach to run fast narrow & open searches (>5x faster than MSFragger). I would be happy to assist with getting it integrated.

Oct 22 '22 16:10 lazear

Hi Michael,

Yes! In fact, we have already talked about the option of adding Sage!

I will try to get the time to dig a bit deeper next week and get back to you when I have gotten an overview of how easy it would be to include Sage in SearchGUI (and PeptideShaker). But as far as I remember there did not seem to be any real issues.

Best regards, Harald

Oct 23 '22 08:10 hbarsnes

Hi again,

I had a closer look at the possibility of adding Sage to SearchGUI and as far as I can see there should not be any major showstoppers. I did however come across a couple of things that ought to be addressed before we can continue.

The first thing I noticed when trying to run Sage was that it seemed to lack any indication of progress? My search failed quite quickly (more on that below), but for longer searches it would be very nice to have some sort of indication of progress? Nothing fancy, just some text showing what the tool is currently doing and ideally a percent indicator. (It could of course be that such information is already displayed after the step where my test came to an end, but would still be great to have some indication as soon as the tool starts as well?)

When going through the json parameters I could not find any way of specifying the enzyme? Seems like trypsin is assumed? Would it be possible to add support for additional enzymes?

Similarly, it would be great to be able to define terminal modifications at specific amino acids? Maybe this is already possible? I did not find any examples in the json comments though?

For the decoy_prefix option, would it be possible to also support suffix annotation? As our headers generally look like this: target: >sp|Q66K14|TBC9B_HUMAN TBC1 domain family member 9B OS=Homo sapiens OX=9606 GN=TBC1D9B PE=1 SV=3 decoy: >sp|Q66K14_REVERSED|TBC9B_HUMAN TBC1 domain family member 9B OS=Homo sapiens OX=9606 GN=TBC1D9B PE=1 SV=3_REVERSED

You write that "MS2 search results will be stored as a Percolator-compatible (<mzML path>.sage.pin) file". Could you share an example of a results file from Sage such that we can verify whether it contains the information we need to process it in PeptideShaker?

Finally, below is the error I got when trying to run the following command line: sage -f uniprot-human-reviewed-trypsin-june-2021_concatenated_target_decoy.fasta -o output config-simple.json qExactive01819.mzML

[2022-10-26T19:20:43Z INFO  sage] generated 248225993 fragments in 44516ms
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!', ', ', src\spectrum.rssrc\spectrum.rssrc\spectrum.rs:::220220220::1414:

I used a standard FASTA file from UniProt and the mzML file is based on a conversion from our standard example raw file. You can find all of the input files here: https://www.dropbox.com/s/9qg5dhsxdf0pqrz/Sage%20test%20files.zip?dl=0

Hopefully all of these points should not be too hard to fix/implement?

Best regards, Harald

Oct 26 '22 19:10 hbarsnes

Hi Harald,

Thanks for the comments.

A progress indicator could be added... I had one initially, but using a progress bar actually imposes significant overhead on search times. setting the SAGE_LOG=trace environment variable will increase the verbosity of outputs
Trypsin is currently assumed... Rust's regex library does not support PCRE (ie lookaround, backtracking, since these can lead to denial-of-service). I need to figure out a way to specify additional enzymes, likely hard coding them.
N-terminal modifications are currently supported, but not for specific amino acids. I think this should be straightforward
I need to update the decoy_prefix option. Right now, I am using this option to ignore decoy entries from the fasta file. Internal decoy generation is performed by reversing the internal amino acids for each peptide. This enables the picked-peptide approach described in https://www.biorxiv.org/content/10.1101/2022.05.11.491571v2 I will eventually re-enable user-provided decoy's, but this will require disabling Sage's built-in peptide FDR (I perform PSM rescoring using a linear discriminant machine learning model)
I have provided a snippet from an output file: https://gist.github.com/lazear/b52a53c720b425022554bbc490f8c1ee - columns can be added to/changed as needed.
Thanks for providing the files, I will check them out. I have been using both MSConvert & ThermoRawFileParser without issue so far.

I am definitely interested in getting integrated with PeptideShaker - more so than SearchGUI. Seems like less blocking issues too, while I figure out how/if to tackle some of the points raised above.

Best, Mike

Oct 26 '22 21:10 lazear

Hi Mike,

A progress indicator could be added... I had one initially, but using a progress bar actually imposes significant overhead on search times. setting the SAGE_LOG=trace environment variable will increase the verbosity of outputs

If a progress indicator takes up too much resources, maybe just adding short messages indicating which step that is currently being executed?

I need to update the decoy_prefix option. Right now, I am using this option to ignore decoy entries from the fasta file. Internal decoy generation is performed by reversing the internal amino acids for each peptide.

This is not really a showstopper from our end as we have to recalculate the FDR in PeptideShaker anyway after merging the results from all of the individual search engines. So the most important for us here is actually being able to tell Sage to treat all of the provided sequences in the same way, i.e. disregarding the decoy tags. But I guess that is what happens if simply leaving out the decoy_prefix option?

I have provided a snippet from an output file: https://gist.github.com/lazear/b52a53c720b425022554bbc490f8c1ee - columns can be added to/changed as needed.

Thanks! Seems to have most of what we need. The only thing I could spot missing after a quick check is a way to link back to the original spectrum. For this we need the name of the spectrum file and ideally the spectrum title or index. Perhaps the scan number can be used for the latter, but I would need to test.

We also prefer having more than one hit per spectrum. I see that you do not recommend this in the comments for your json example file, but if we do increase the report_psms option to a number higher than one, how would the ranks be shown in the output file?

Finally, it would be great to have a way to indicate that this file came from Sage and ideally from which version as well?

Best regards, Harald

Oct 27 '22 10:10 hbarsnes

Sounds good - just to clarify, these are the things that need to be resolved for PeptideShaker integration:

Add in option to override Sage's decoy generation/turn off PSM-rescoring
Add spectrum file name & index or scan number (I think scan number should work. I would prefer to avoid having to keep track of a spectrum title string for perf/memory usage reasons)
Add a column indicating rank of PSM for that scan
Indication of Sage version - Do you suggest just an additional column, or maybe have it encoded in the file name? I am also open to producing other output file formats

Anything else I missed?

Oct 27 '22 15:10 lazear

Yes, I think that should cover it. Not sure about the scan number. I guess we'll find out when trying to implement it. But maybe you can also keep the index from the mzML headers (instead of the titles/ids)?

To be honest I'm not really sure what the scan number represents, as it does not seem to be a separate tag and rather part of the spectrum id, i.e. title, tag?

Example: <spectrum id="controllerType=0 controllerNumber=1 scan=9155" index="9154" defaultArrayLength="85">

Regarding how to indicate the Sage version I'm not sure what the best option would be. Is there a way of adding comments at the start of a pin file, i.e. before the actual data starts? Or would this result in the pin file not being readable by other tools? Adding a new column may be the easiest option as it can simply be ignored by other tools. In any case, it is more important to be able to tell that the pin file came from Sage than the exact version number. At the moment there is nothing in the current file that can be used for this I think? Would it make sense to add Sage to any of the current column headers? If this information cannot easily be added, we should still be able to parse the file, we just would not be able to tell which algorithm that was used to generate it and will simply have to call it "generic pin file format" or something like that in PeptideShaker.

Oct 28 '22 11:10 hbarsnes

I pulled out the scan number because it was required for percolator/mokapot input ("ScanNr"), and is also reported by Proteome Discoverer, MSFragger, Comet, etc - this enables easy comparison of results from multiple search engines (using Pandas, etc). I do extract it from the spectrum id field - it seems like generally spectrum index is the 0-index based scan #, but I'm sure this can't be relied upon 😄

I can change the discriminant_score column to something like sage_discriminant_score

Oct 28 '22 23:10 lazear

Here are the definitions from the mzML specification document:

id - xs:string (pattern: \S+=\S+( \S+=\S+)*) - The native identifier for a spectrum. For unmerged native spectra or spectra from older open file formats, the format of the identifier is defined in the PSI-MS CV and referred to in the mzML header. External documents may use this identifier together with the mzML filename or accession to reference a particular spectrum.

index - xs:nonNegativeInteger - The zero-based, consecutive index of the spectrum in the SpectrumList.

To me it seems like the index ought to be included in the pin file as it is the only guaranteed way of referring back to a given spectrum in the mzML file. Actually, it would also be great to have the name of the mzML file in a separate column so that we do not have to guess this from the name of the pin file?

I can change the discriminant_score column to something like sage_discriminant_score

That would work. :)

Oct 29 '22 10:10 hbarsnes

That is a fair point - clearly, though, scan numbers can be used to uniquely refer to spectra, given that they are the preferred indexType in the USI format (https://raw.githubusercontent.com/HUPO-PSI/usi/master/Specification/USI_SpecDoc_1.0.draft10_2021-05-20.pdf). They do also note this complication with some instruments (appears to be just AB SCIEX?) not reporting scan numbers.

Currently, Sage will default to using the spectrum index + 1 as the internal "scan number" if the scan field isn't available. This doesn't seem like correct behavior, but acts as a work around if someone uses a SCIEX instrument, I suppose (which I don't think has happened yet...).

I will look into the performance cost of keeping track of the entire spectrum ID string, if it's acceptable, then I will implement that!

Oct 29 '22 14:10 lazear

[X] Use spectrum title/ID instead of extracing scan number
[X] Rename discriminant_score to sage_discriminant_score
[X] Add a column indicating the rank of a PSM for a given spectrum

Here are a few lines from an example file - let me know if this works! https://gist.github.com/lazear/822a25d612960efa772db402364ffdd4

Oct 31 '22 00:10 lazear

Also, regarding the above error:

Finally, below is the error I got when trying to run the following command line: sage -f uniprot-human-reviewed-trypsin-june-2021_concatenated_target_decoy.fasta -o output config-simple.json qExactive01819.mzML [2022-10-26T19:20:43Z INFO sage] generated 248225993 fragments in 44516ms thread 'thread 'thread '' panicked at '' panicked at '' panicked at 'missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!missing precursor information for MS2 scan, please check input files!', ', ', src\spectrum.rssrc\spectrum.rssrc\spectrum.rs:::220220220::1414: I used a standard FASTA file from UniProt and the mzML file is based on a conversion from our standard example raw file. You can find all of the input files here: https://www.dropbox.com/s/9qg5dhsxdf0pqrz/Sage%20test%20files.zip?dl=0

This is occurring because there is no <cvParam cvRef="MS" accession="MS:1000041" name="charge state" value="3"/> term under the <selectedIon> parent term... Not sure how best to handle precursors without charge state information, I suppose search charge states 2-5 and merge the PSMs lists.

Oct 31 '22 00:10 lazear

Here are a few lines from an example file - let me know if this works!

Thanks! I will look at it as soon as possible. I'll keep you posted.

Not sure how best to handle precursors without charge state information, I suppose search charge states 2-5 and merge the PSMs lists.

Yes, I think that makes sense and is in line with what other search engines seem to do.

Oct 31 '22 09:10 hbarsnes

Here are a few lines from an example file - let me know if this works!

Yes, it does! I've now implemented a parser for Percolator pin files in PeptideShaker and it is able to load your short example file!

I've deployed a beta version that you can test here: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

I have only tested it on the small example file yet though, so there are probably still undetected issues. Note also that the pin file parser currently only supports your specific pin file format and requires that the following column headers are used: peptide, charge, posterior_error, rank, scannr and filename (in any order). (A column named sage_discriminant_score is also needed to detect that the file was generated by Sage, but this is not mandatory.)

Please try the beta version on a complete Sage results file and let me know how it goes?

Oct 31 '22 23:10 hbarsnes

I got it to load! Couple issues/comment:

I don't see any peptides with C[+57.0214] modifications - setting it as a variable mod in PeptideShaker seems to fix this though
~~Decoys aren't loading (I think we knew this would happen though, since Sage is currently digesting/reversing in memory).~~
Sage can also ID peptides with a -1 isotope error, this doesn't seem supported
Is it possible to display a discriminant_score column or otherwise? Each engine reports some score, but I don't see it (I might just be missing it though)

N-term mods

Looks like there is a failure with N-terminally modified peptides (out of bounds) here: https://github.com/compomics/compomics-utilities/blob/0a75023c9ba7b4bb6676460b70005ec0ccee88a8/src/main/java/com/compomics/util/experiment/io/identification/idfilereaders/PercolatorInputfileReader.java#L259

I'm using ProForma suggested notation for modifications now, so will have something like:

[+229.1629]-VC[+57.0215]PAPC[+57.0215]EGAC[+57.0215]TLGIIEDPVGIK[+229.1629]

java.lang.StringIndexOutOfBoundsException: Index -1 out of bounds for length 35
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55)
	at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
	at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
	at java.base/java.lang.String.checkIndex(String.java:4557)
	at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:46)
	at java.base/java.lang.String.charAt(String.java:1515)
	at com.compomics.util.experiment.io.identification.idfilereaders.PercolatorInputfileReader.getAllSpectrumMatches(PercolatorInputfileReader.java:259)
	at eu.isas.peptideshaker.fileimport.FileImporter.importPsms(FileImporter.java:466)
	at eu.isas.peptideshaker.fileimport.FileImporter.importFiles(FileImporter.java:277)
	at eu.isas.peptideshaker.PeptideShaker.importFiles(PeptideShaker.java:219)
	at eu.isas.peptideshaker.gui.NewDialog$20.run(NewDialog.java:736)
	at java.base/java.lang.Thread.run(Thread.java:833)`
``

Nov 01 '22 00:11 lazear

I've added support for PeptideShaker/SearchGUI style database, and an option to disable internal decoy generation so we are good on that front - looks absolutely great, numbers are matching what I expect, etc!

One additional bug/quirk: I noticed that when reporting multiple PSMs for the same spectrum (e.g. report_psms: 2), PeptideShaker reports about half as many passing peptides, proteins, and PSMs. (20k PSMs -> 9k PSMs) compared to reporting a single PSM per spectra. I assumed the rank column is 1 = best, 2 = second best, etc?

Nov 01 '22 03:11 lazear

I don't see any peptides with C[+57.0214] modifications - setting it as a variable mod in PeptideShaker seems to fix this though

By default fixed modification are not shown. You can change this in the View > Fixed Modifications menu option.

Sage can also ID peptides with a -1 isotope error, this doesn't seem supported

You would have to set this in the search settings before loading the data. You'll find it in the same dialog has where you set the PTMs. See the Isotopes option in the lower right-hand corner. The default is 0-1.

Is it possible to display a discriminant_score column or otherwise? Each engine reports some score, but I don't see it (I might just be missing it though)

The only place that you will see the search engine scores will be in the Spectrum IDs tab and only if also checking the View > Scores menu option. However, at the moment I'm using the posterior_error value and not the sage_discriminant_score. Do you want me to change this?

Looks like there is a failure with N-terminally modified peptides

Indeed, as I did not have any such PTMs in the short example file. Should not be too hard to add the required support though. I will look into it a deploy a new version.

One additional bug/quirk: I noticed that when reporting multiple PSMs for the same spectrum (e.g. report_psms: 2), PeptideShaker reports about half as many passing peptides, proteins, and PSMs. (20k PSMs -> 9k PSMs) compared to reporting a single PSM per spectra. I assumed the rank column is 1 = best, 2 = second best, etc?

I will need to look deeper into this later. But yes, rank 1 is best, rank 2 is second best etc. Note that we only retain the best one though (except for in the Spectrum IDs tab). The additional hits are mainly there for better comparison between search engines, or so that we can overrule the search engine if we don't agree with the best hit or or if more than one hit have the same score.

Nov 01 '22 13:11 hbarsnes

I've just redeployed the beta version which should now also support terminal PTMs: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

Nov 01 '22 13:11 hbarsnes

For cysteine mod, the peptides aren't making it through the import filter:

Tue Nov 01 07:28:03 PDT 2022 Importing PSMs from search.pin Tue Nov 01 07:28:05 PDT 2022 1118 identified spectra (2.7%) did not present a valid peptide. Tue Nov 01 07:28:05 PDT 2022 9916 of the best scoring peptides were excluded by the import filters: Tue Nov 01 07:28:05 PDT 2022 - 11.3% peptide mapping to both target and decoy. Tue Nov 01 07:28:05 PDT 2022 - 88.7% unrecognized modifications.

I suppose PS is looking for peptides of the form "PEPCA" not "PEPC[+57.0214]A" with fixed modifications?

posterior_error

No, this should be better/more stable across runs, thanks!

I'll test out with N-term mods in a bit.

Nov 01 '22 14:11 lazear

I suppose PS is looking for peptides of the form "PEPCA" not "PEPC[+57.0214]A" with fixed modifications?

We're just parsing all of the modifications provided in the peptide sequences. These should later be mapped to the correct modifications, i.e. comparing to the modifications set in the search settings. But perhaps something is going wrong there. Can you share the pin file so that I can test?

No, this should be better/more stable across runs, thanks!

So stick with posterior_error?

Nov 01 '22 14:11 hbarsnes

We're just parsing all of the modifications provided in the peptide sequences.

Seems like we currently treat all modifications in pin files as variable. Hence, I need to detect the fixed ones and treat them differently. I'll see what I can do and deploy a new version.

Nov 01 '22 14:11 hbarsnes

Here is a link to the PeptideShaker project zip file - can use this mzml file for additional testing too: https://www.dropbox.com/s/kp057sc4n9t92mj/b.psdb?dl=0

And yes, I should've been more clear - stick with posterior_error!

Nov 01 '22 14:11 lazear

Here is a link to the PeptideShaker project zip file

Seems to only contain the psdb file? Better if you could share the pin file and the search settings used (I already have the mzml file)? That way I can try to load the data myself and hopefully fix the fixed modification issue.

Nov 01 '22 15:11 hbarsnes

I've just redeployed the beta version which should now also support fixed PTMs: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip

Nov 01 '22 15:11 hbarsnes

Here is a link to the PeptideShaker project zip file

Seems to only contain the psdb file? Better if you could share the pin file and the search settings used (I already have the mzml file)? That way I can try to load the data myself and hopefully fix the fixed modification issue.

Whoops... copied the wrong file. Here is a link for the zipped results files/parameters: https://www.dropbox.com/s/9lyf2qoobwhe7t3/Archive.zip?dl=0

Nov 01 '22 16:11 lazear

Whoops... copied the wrong file. Here is a link for the zipped results files/parameters: https://www.dropbox.com/s/9lyf2qoobwhe7t3/Archive.zip?dl=0

Thanks! As far as I can tell it all seems to be working fine now? At least I'm able to load this particular file without any issues. :)

Nov 01 '22 17:11 hbarsnes

It seems to be working for me as well! 😄 I've been able to load TMT-modified, as well as unmodified datasets, and also load in Sage results with other engine results produced by SearchGUI.

Thanks for getting this done so rapidly!

I wonder if it would be better for me to produce something like a ".sage.tsv" file instead of a generic percolator PIN file, given that most search engines won't be writing a "posterior_error"/PEP column until after percolator has been run (I don't think I'm strictly adhering to the PIN format either... it's kind of morphed over time). Not a big issue either way. I think the ability to load in these simply formatted files will be beneficial for enabling comparisons with other engines as well, like Proteome Discoverer.

One more thing: I noticed that the Score column always displays "100" regardless of the underlying raw score, perhaps this is a consequence of the posterior_error value always being negative?

Nov 01 '22 19:11 lazear

It seems to be working for me as well! 😄 I've been able to load TMT-modified, as well as unmodified datasets, and also load in Sage results with other engine results produced by SearchGUI.

Great! Then all that remains is to also add Sage to SearchGUI? :)

I wonder if it would be better for me to produce something like a ".sage.tsv" file instead of a generic percolator PIN file

Probably, as our current parser is rather Sage-specific anyway and not really a generic pin file parser.

One more thing: I noticed that the Score column always displays "100" regardless of the underlying raw score, perhaps this is a consequence of the posterior_error value always being negative?

Not sure, could be. If you look at the validation plots in the Validation tab there seems to be a very clear separation between the good scoring target hits and the bad scoring decoy hits. At least at the protein and peptide level. The PSM level is however quite strange with only hits scoring 100 for both target and decoy. Which is very different from the PeptideShaker example dataset. Not really sure why? However, the Score/Confidence column is often, but not always, displaying 100 for the PeptideShaker example dataset as well. I think I will have to forward this question to my co-developer Marc Vaudel as this is more his field of expertise. @mvaudel Any comments?

Nov 01 '22 20:11 hbarsnes

Great! Then all that remains is to also add Sage to SearchGUI? :)

I think so - I will get started on the remaining issues. I think additional enzyme support is probably the biggest blocker?

Probably, as our current parser is rather Sage-specific anyway and not really a generic pin file parser.

I will do this then. I noticed this line of code: https://github.com/compomics/compomics-utilities/blob/d5e88a13d298d9b159b37a93ba1e04124693c730/src/main/java/com/compomics/util/experiment/io/identification/idfilereaders/PercolatorInputfileReader.java#L280

This is, I believe, incorrect behavior. There should be a check to see if charAt(i-2) is a dash (indicating a C-terminal mod... not supported yet). For instance, the peptide [+304.2071]-PEPTIDMK[+304.2071] would parse as M having the +304.2071 mod (assuming that was a variable mod). This should be distinct from [+304.2071]-PEPTIDMK[+304.2071]-[+304.2071] (C-terminal modification)

This hasn't bitten me yet, and I don't have any test files with such cases yet, but I imagine it will happen at some point.

Not sure, could be. If you look at the validation plots in the Validation tab there seems to be a very clear separation between the good scoring target hits and the bad scoring decoy hits. At least at the protein and peptide level. The PSM level is however quite strange with only hits scoring 100 for both target and decoy.

Yeah, that's exactly what I noticed as well

Nov 01 '22 21:11 lazear

Wow this is all very impressive, both in quality and speed.

@mvaudel Any comments?

If my memory is correct (this is 10+ years old code) the score column on the GUI requires the score to distribute like a PEP and displays 100*(1-PEP), i.e. a good score should be close to zero and then the column displays 100, and a bad score should be close to 1 and then the column displays 0.

Nov 01 '22 21:11 mvaudel

I think additional enzyme support is probably the biggest blocker?

Yes, I think so. Would also be great to get the precursors without charge state information issue addressed? (As this would allow us to also search our (old) standard example dataset.)

I'm now fully booked with other things until the end of this week. I'll move on to the SearchGUI support over the weekend.

There should be a check to see if charAt(i-2) is a dash (indicating a C-terminal mod... not supported yet).

Yes, you are correct. This was indeed a bug. I made a slightly different fix that seems to work. Redeployed at the same link: https://genesis.ugent.be/maven2/eu/isas/peptideshaker/PeptideShaker/2.2.18-beta/PeptideShaker-2.2.18-beta.zip.

Nov 01 '22 23:11 hbarsnes

searchgui searchgui copied to clipboard

Integration with Sage

N-term mods

searchgui
searchgui copied to clipboard