FragPipe
FragPipe copied to clipboard
Philosopher 4.1.1 generates empty files due to the incompatibility of the fasta file
Dear developer team, I am trying to use the fasta database generated in this paper: https://www.nature.com/articles/s41587-021-01021-3. In brief the new fasta file has additional~320K proteins resulted from different RNA sequencing data. The v.16 of MSFragger handled these data very nicely. Unfortunately I don't know why I can't replicate the analysis in v.17.1. The job finished without any errors but the list of the peptide/protein are empty. For validation I am using the peptideprophet (for unspecific search). I have put the output of the analysis in here: https://www.dropbox.com/sh/ciq36i6shg79d6z/AAA1l-4QX1t5ZjJjXtbPmp91a?dl=0 Thank you again as always for your great program and support.
Everything looks good except there is no entries in the tsv files.
Felipe @prvst , can you take a look? They said that FragPipe 16, which implied the Philosopher 4.0.0, worked well.
Thanks,
Fengchao
@fazeliniah are you running Philosopher v4.1.1?
I am using Philosopher version 4.1.0
I just tried the v.4.1.1 and got the same issue.
@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.
INFO[16:00:38] 1+ Charge profile decoy=29 target=265
INFO[16:00:38] 2+ Charge profile decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile decoy=25 target=628
INFO[16:00:38] 5+ Charge profile decoy=0 target=0
INFO[16:00:38] 6+ Charge profile decoy=0 target=0
INFO[16:00:38] Database search results ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications
INFO[16:00:43] Assigning protein identifications to layers
INFO[16:00:46] Processing protein inference
INFO[16:02:26] Synchronizing PSMs and proteins
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing ions=4613 peptides=3869 proteins=1742 psms=6150
Thanks for reporting this.
Added to v4.1.2
Hi Felipe @prvst ,
The interact-*.pep.xml
from Percolator won't have such replacement. Will your changes break the Percolator related workflows?
BTW, what kind of characters does PeptideProphet replaced?
Best,
Fengchao
Sorry, I forgot about Percolator. If the parsing rules are different, then yes, it will brake the logic. PeptideProphet replaces the pipe character ( | ) by an empty space. This is the only one I'm aware of at this moment, I don't know if the same thing happens with other special characters
Thanks for the info. I think we can do the same for Percolator. Let me see if I can find the code in PeptideProphet to get all of the characters to be replaced.
Best,
Fengchao
Hi Fengchao and team, Hope you all had a wonderful holiday. I am just wondering if there is an update for this issue. Thanks
I guess you need to check with Felipe @prvst about the fixed Philosopher.
Best,
Fengchao
@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.
INFO[16:00:38] 1+ Charge profile decoy=29 target=265 INFO[16:00:38] 2+ Charge profile decoy=146 target=3776 INFO[16:00:38] 3+ Charge profile decoy=114 target=3031 INFO[16:00:38] 4+ Charge profile decoy=25 target=628 INFO[16:00:38] 5+ Charge profile decoy=0 target=0 INFO[16:00:38] 6+ Charge profile decoy=0 target=0 INFO[16:00:38] Database search results ions=6159 peptides=5252 psms=8014 INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs decoy=66 threshold=0.7128 total=6612 INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides decoy=41 threshold=0.8093 total=4122 INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions decoy=49 threshold=0.7409 total=4976 INFO[16:00:39] Protein inference results decoy=297 target=3657 INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins decoy=19 threshold=0.9706 total=1838 INFO[16:00:39] Applying sequential FDR estimation ions=4733 peptides=3975 psms=6298 INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs decoy=25 threshold=0.7132 total=6298 INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides decoy=19 threshold=0.7132 total=3975 INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions decoy=19 threshold=0.7132 total=4733 INFO[16:00:40] Post processing identifications INFO[16:00:43] Assigning protein identifications to layers INFO[16:00:46] Processing protein inference INFO[16:02:26] Synchronizing PSMs and proteins INFO[16:02:26] Total report numbers after FDR filtering, and post-processing ions=4613 peptides=3869 proteins=1742 psms=6150
Thanks for reporting this.
Added to v4.1.2
Please refer to my previous reply. Peptideprophet is replacing some special characters, like the pipe, by an empty space, you might want to avoid them, or use a standard format.
@fcyu You mentioned above that you would look the PeptideProhet source code to look for the special characters that are replaced, did you make any progress on that?
Hi Felipe @prvst ,
Yes, please check the code here https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Common/util.cpp#l533. The XMLEscape(const string& s)
function is used by the RefreshParser.cpp
: https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Parsers/RefreshParser/RefreshParser.cpp#l1660
However, I could not find any code replacing |
with space, can you confirm that it is replaced?
BTW, I think it might not be a good idea using the protein description as part of the ID. There are tools modifying or truncating the protein description in different ways in writing the result. You will not be able to map proteins back to the fasta file.
Best,
Fengchao
However, I could not find any code replacing | with space, can you confirm that it is replaced?
Yes, the description is modified
OK, actually, PeptideProphet does not replace |
:
But ProteinProphet does:
I will read the ProteinProphet code then.
Best,
Fengchao
@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.
INFO[16:00:38] 1+ Charge profile decoy=29 target=265 INFO[16:00:38] 2+ Charge profile decoy=146 target=3776 INFO[16:00:38] 3+ Charge profile decoy=114 target=3031 INFO[16:00:38] 4+ Charge profile decoy=25 target=628 INFO[16:00:38] 5+ Charge profile decoy=0 target=0 INFO[16:00:38] 6+ Charge profile decoy=0 target=0 INFO[16:00:38] Database search results ions=6159 peptides=5252 psms=8014 INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs decoy=66 threshold=0.7128 total=6612 INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides decoy=41 threshold=0.8093 total=4122 INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions decoy=49 threshold=0.7409 total=4976 INFO[16:00:39] Protein inference results decoy=297 target=3657 INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins decoy=19 threshold=0.9706 total=1838 INFO[16:00:39] Applying sequential FDR estimation ions=4733 peptides=3975 psms=6298 INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs decoy=25 threshold=0.7132 total=6298 INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides decoy=19 threshold=0.7132 total=3975 INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions decoy=19 threshold=0.7132 total=4733 INFO[16:00:40] Post processing identifications INFO[16:00:43] Assigning protein identifications to layers INFO[16:00:46] Processing protein inference INFO[16:02:26] Synchronizing PSMs and proteins INFO[16:02:26] Total report numbers after FDR filtering, and post-processing ions=4613 peptides=3869 proteins=1742 psms=6150
Thanks for reporting this. Added to v4.1.2
Please refer to my previous reply. Peptideprophet is replacing some special characters, like the pipe, by an empty space, you might want to avoid them, or use a standard format.
@fcyu You mentioned above that you would look the PeptideProhet source code to look for the special characters that are replaced, did you make any progress on that?
It is here https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Validation/ProteinProphet/ProteinProphet.cpp#l7256
Thanks @guoci , it does have more rules than replacing |
with
. I can add them to MSFragger so that downstream tools will no need to make any changes.
Best,
Fengchao
Hi @fazeliniah ,
Can you re-analyze your data using this MSFragger (https://www.dropbox.com/s/xggvogvbqq7nmhf/MSFragger-3.5-rc8.zip?dl=0)? It will clean the protein description according to the rules used by ProteinProphet, which will prevent from triggering Philosopher's bug.
Best,
Fengchao
Sorry that I forgot one more thing.
With this change in MSFragger, we don't need to change Percolator and other tools because the the protein descriptions have already been cleaned up at the very beginning (ProteinProphet won't change the protein descriptions anymore).
But, Philosopher still needs to have the same cleaning up rules in load the fasta file, otherwise, it will not be able to map the proteins in pep.xml back to the fasta file.
Felipe @prvst , can you make the changes according to the cleanUpProteinDescription
function pointed out by Guo Ci, and send the fixed Philosopher?
Thanks,
Fengchao
Hi @fazeliniah ,
Can you re-analyze your data using this MSFragger (https://www.dropbox.com/s/xggvogvbqq7nmhf/MSFragger-3.5-rc8.zip?dl=0)? It will clean the protein description according to the rules used by ProteinProphet, which will prevent from triggering Philosopher's bug.
Best,
Fengchao
Hi Fengchao, I tested the MSFragger 3.4 and 3.5 and they both work nicely with our HLA peptidome project. The issue was related to our RNA-derived fasta database. The presence of new characters in the header (e.g. +, -, *, ~) and some duplicate sequences were the main issue. Thank you again for all your help. Thanks
I observed the same issue with standard search using GenCode database. Need to fix for the next release