FragPipe icon indicating copy to clipboard operation
FragPipe copied to clipboard

Philosopher 4.1.1 generates empty files due to the incompatibility of the fasta file

Open fazeliniah opened this issue 2 years ago • 20 comments

Dear developer team, I am trying to use the fasta database generated in this paper: https://www.nature.com/articles/s41587-021-01021-3. In brief the new fasta file has additional~320K proteins resulted from different RNA sequencing data. The v.16 of MSFragger handled these data very nicely. Unfortunately I don't know why I can't replicate the analysis in v.17.1. The job finished without any errors but the list of the peptide/protein are empty. For validation I am using the peptideprophet (for unspecific search). I have put the output of the analysis in here: https://www.dropbox.com/sh/ciq36i6shg79d6z/AAA1l-4QX1t5ZjJjXtbPmp91a?dl=0 Thank you again as always for your great program and support.

fazeliniah avatar Nov 23 '21 17:11 fazeliniah

Everything looks good except there is no entries in the tsv files.

Felipe @prvst , can you take a look? They said that FragPipe 16, which implied the Philosopher 4.0.0, worked well.

Thanks,

Fengchao

fcyu avatar Nov 23 '21 17:11 fcyu

@fazeliniah are you running Philosopher v4.1.1?

prvst avatar Nov 25 '21 21:11 prvst

I am using Philosopher version 4.1.0

fazeliniah avatar Nov 29 '21 16:11 fazeliniah

I just tried the v.4.1.1 and got the same issue.

fazeliniah avatar Nov 29 '21 19:11 fazeliniah

@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.

INFO[16:00:38] 1+ Charge profile                             decoy=29 target=265
INFO[16:00:38] 2+ Charge profile                             decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile                             decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile                             decoy=25 target=628
INFO[16:00:38] 5+ Charge profile                             decoy=0 target=0
INFO[16:00:38] 6+ Charge profile                             decoy=0 target=0
INFO[16:00:38] Database search results                       ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs        decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides    decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions        decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results                     decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins    decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation            ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs        decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides    decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions        decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications              
INFO[16:00:43] Assigning protein identifications to layers  
INFO[16:00:46] Processing protein inference                 
INFO[16:02:26] Synchronizing PSMs and proteins              
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing  ions=4613 peptides=3869 proteins=1742 psms=6150

Thanks for reporting this.

Added to v4.1.2

prvst avatar Dec 03 '21 21:12 prvst

Hi Felipe @prvst ,

The interact-*.pep.xml from Percolator won't have such replacement. Will your changes break the Percolator related workflows?

BTW, what kind of characters does PeptideProphet replaced?

Best,

Fengchao

fcyu avatar Dec 03 '21 21:12 fcyu

Sorry, I forgot about Percolator. If the parsing rules are different, then yes, it will brake the logic. PeptideProphet replaces the pipe character ( | ) by an empty space. This is the only one I'm aware of at this moment, I don't know if the same thing happens with other special characters

prvst avatar Dec 03 '21 21:12 prvst

Thanks for the info. I think we can do the same for Percolator. Let me see if I can find the code in PeptideProphet to get all of the characters to be replaced.

Best,

Fengchao

fcyu avatar Dec 03 '21 21:12 fcyu

Hi Fengchao and team, Hope you all had a wonderful holiday. I am just wondering if there is an update for this issue. Thanks

fazeliniah avatar Jan 03 '22 17:01 fazeliniah

I guess you need to check with Felipe @prvst about the fixed Philosopher.

Best,

Fengchao

fcyu avatar Jan 03 '22 21:01 fcyu

@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.

INFO[16:00:38] 1+ Charge profile                             decoy=29 target=265
INFO[16:00:38] 2+ Charge profile                             decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile                             decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile                             decoy=25 target=628
INFO[16:00:38] 5+ Charge profile                             decoy=0 target=0
INFO[16:00:38] 6+ Charge profile                             decoy=0 target=0
INFO[16:00:38] Database search results                       ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs        decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides    decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions        decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results                     decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins    decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation            ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs        decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides    decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions        decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications              
INFO[16:00:43] Assigning protein identifications to layers  
INFO[16:00:46] Processing protein inference                 
INFO[16:02:26] Synchronizing PSMs and proteins              
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing  ions=4613 peptides=3869 proteins=1742 psms=6150

Thanks for reporting this.

Added to v4.1.2

Please refer to my previous reply. Peptideprophet is replacing some special characters, like the pipe, by an empty space, you might want to avoid them, or use a standard format.

@fcyu You mentioned above that you would look the PeptideProhet source code to look for the special characters that are replaced, did you make any progress on that?

prvst avatar Jan 03 '22 21:01 prvst

Hi Felipe @prvst ,

Yes, please check the code here https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Common/util.cpp#l533. The XMLEscape(const string& s) function is used by the RefreshParser.cpp: https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Parsers/RefreshParser/RefreshParser.cpp#l1660

However, I could not find any code replacing | with space, can you confirm that it is replaced?

BTW, I think it might not be a good idea using the protein description as part of the ID. There are tools modifying or truncating the protein description in different ways in writing the result. You will not be able to map proteins back to the fasta file.

Best,

Fengchao

fcyu avatar Jan 03 '22 21:01 fcyu

However, I could not find any code replacing | with space, can you confirm that it is replaced?

Yes, the description is modified

prvst avatar Jan 03 '22 21:01 prvst

OK, actually, PeptideProphet does not replace |:

image

But ProteinProphet does:

image

I will read the ProteinProphet code then.

Best,

Fengchao

@fazeliniah your issue is somewhat related to a different situation reported a few weeks ago by a different person. Because someone was searching a database containing the same protein with slightly different headers, I had to include the protein description to the method that fetches information from the database annotation. The reason you see a problem with your search is because PeptideProphet also parses the protein description, and replaces some characters by empty spaces. I included the same rule into Philosopher, and the update will be available in the upcoming release that'm planning for the next week.

INFO[16:00:38] 1+ Charge profile                             decoy=29 target=265
INFO[16:00:38] 2+ Charge profile                             decoy=146 target=3776
INFO[16:00:38] 3+ Charge profile                             decoy=114 target=3031
INFO[16:00:38] 4+ Charge profile                             decoy=25 target=628
INFO[16:00:38] 5+ Charge profile                             decoy=0 target=0
INFO[16:00:38] 6+ Charge profile                             decoy=0 target=0
INFO[16:00:38] Database search results                       ions=6159 peptides=5252 psms=8014
INFO[16:00:38] Converged to 1.00 % FDR with 6546 PSMs        decoy=66 threshold=0.7128 total=6612
INFO[16:00:38] Converged to 1.00 % FDR with 4081 Peptides    decoy=41 threshold=0.8093 total=4122
INFO[16:00:38] Converged to 0.99 % FDR with 4927 Ions        decoy=49 threshold=0.7409 total=4976
INFO[16:00:39] Protein inference results                     decoy=297 target=3657
INFO[16:00:39] Converged to 1.04 % FDR with 1819 Proteins    decoy=19 threshold=0.9706 total=1838
INFO[16:00:39] Applying sequential FDR estimation            ions=4733 peptides=3975 psms=6298
INFO[16:00:39] Converged to 0.39 % FDR with 6273 PSMs        decoy=25 threshold=0.7132 total=6298
INFO[16:00:39] Converged to 0.48 % FDR with 3956 Peptides    decoy=19 threshold=0.7132 total=3975
INFO[16:00:39] Converged to 0.40 % FDR with 4714 Ions        decoy=19 threshold=0.7132 total=4733
INFO[16:00:40] Post processing identifications              
INFO[16:00:43] Assigning protein identifications to layers  
INFO[16:00:46] Processing protein inference                 
INFO[16:02:26] Synchronizing PSMs and proteins              
INFO[16:02:26] Total report numbers after FDR filtering, and post-processing  ions=4613 peptides=3869 proteins=1742 psms=6150

Thanks for reporting this. Added to v4.1.2

Please refer to my previous reply. Peptideprophet is replacing some special characters, like the pipe, by an empty space, you might want to avoid them, or use a standard format.

@fcyu You mentioned above that you would look the PeptideProhet source code to look for the special characters that are replaced, did you make any progress on that?

fcyu avatar Jan 03 '22 21:01 fcyu

It is here https://sourceforge.net/p/sashimi/code/HEAD/tree/trunk/trans_proteomic_pipeline/src/Validation/ProteinProphet/ProteinProphet.cpp#l7256

guoci avatar Jan 03 '22 21:01 guoci

Thanks @guoci , it does have more rules than replacing | with . I can add them to MSFragger so that downstream tools will no need to make any changes.

Best,

Fengchao

fcyu avatar Jan 03 '22 22:01 fcyu

Hi @fazeliniah ,

Can you re-analyze your data using this MSFragger (https://www.dropbox.com/s/xggvogvbqq7nmhf/MSFragger-3.5-rc8.zip?dl=0)? It will clean the protein description according to the rules used by ProteinProphet, which will prevent from triggering Philosopher's bug.

Best,

Fengchao

fcyu avatar Jan 03 '22 22:01 fcyu

Sorry that I forgot one more thing.

With this change in MSFragger, we don't need to change Percolator and other tools because the the protein descriptions have already been cleaned up at the very beginning (ProteinProphet won't change the protein descriptions anymore).

But, Philosopher still needs to have the same cleaning up rules in load the fasta file, otherwise, it will not be able to map the proteins in pep.xml back to the fasta file.

Felipe @prvst , can you make the changes according to the cleanUpProteinDescription function pointed out by Guo Ci, and send the fixed Philosopher?

Thanks,

Fengchao

Hi @fazeliniah ,

Can you re-analyze your data using this MSFragger (https://www.dropbox.com/s/xggvogvbqq7nmhf/MSFragger-3.5-rc8.zip?dl=0)? It will clean the protein description according to the rules used by ProteinProphet, which will prevent from triggering Philosopher's bug.

Best,

Fengchao

fcyu avatar Jan 03 '22 23:01 fcyu

Hi Fengchao, I tested the MSFragger 3.4 and 3.5 and they both work nicely with our HLA peptidome project. The issue was related to our RNA-derived fasta database. The presence of new characters in the header (e.g. +, -, *, ~) and some duplicate sequences were the main issue. Thank you again for all your help. Thanks

fazeliniah avatar Jan 21 '22 14:01 fazeliniah

I observed the same issue with standard search using GenCode database. Need to fix for the next release

anesvi avatar Jan 24 '22 19:01 anesvi