FragPipe icon indicating copy to clipboard operation
FragPipe copied to clipboard

Update percolator to pepxml rewriting

Open chhh opened this issue 2 years ago • 8 comments

I would greatly appreciate if you merged this change, it would make maintaining my fork a lot easier.

  • The main thing it does is add some structure to the spaghetti code in PercolatorOutputToPepXML.
  • Adds stox pepxml parsing and a test for what the rewriting actually does.

chhh avatar Dec 07 '22 23:12 chhh

@chhh , there are a lot of changes. I need more time to review and test them. Will revisit this pull request when I have time later.

Thanks,

Fengchao

fcyu avatar Dec 08 '22 01:12 fcyu

@fcyu sure, the main changes are in PercolatorOutputToPepXML.percolatorToPepXML() method. The logic is kept the same, but everything is split into smaller functions.

There's a "test" PercolatorOutputToPepXMLTest with which you can run the function to try it out. Comment out Files.deleteIfExists(path) and you will get an interact- file in resources/percolator-to-pepxml/fragpipe-search-16_PXD022287_msfragger/interact-Human-Protein-Training_Trypsin.pep.xml

chhh avatar Dec 08 '22 02:12 chhh

@guoci @fcyu Is there any hope that you will merge this as a Christmas present?

chhh avatar Dec 20 '22 18:12 chhh

Hi @chhh ,

I started to review the code but got interrupted by other things. I could merge it before next year, but I think we can do it better since you are almost re-writing the whole module.

I always think reading the pepXML file to a class and writing the modified file using SAX is better than manipulating the strings in an ad-hoc way. I left some comments in https://github.com/Nesvilab/FragPipe/blob/3e0189ecb8478ed8ce614bfd8bbcdee98ea05b5c/MSFragger-GUI/src/com/dmtavt/fragpipe/tools/percolator/PercolatorOutputToPepXML.java#L146 and https://github.com/Nesvilab/FragPipe/blob/3e0189ecb8478ed8ce614bfd8bbcdee98ea05b5c/MSFragger-GUI/src/com/dmtavt/fragpipe/tools/percolator/PercolatorOutputToPepXML.java#L218

Batmass-io has the module to read pepXML file, so I think we could use that. I discussed with Guo Ci but no one had the time to do it. Do you think it would be a better idea, especially when you want to make the pepxml rewriting robust and support other flavor?

Best,

Fengchao

fcyu avatar Dec 20 '22 18:12 fcyu

  1. Goal number one is to make it at least a little more readable and less error prone. Look at current percolatorToPepxml() vs the proposed percolatorToPepxml() Instead of one huge inline function 230 lines long which is impossible to really understand what it's doing the new version is only 70 lines despite being able to handle 2 different search engines.

This conversion clearly splits this process into smaller function each related to reading information from files or writing to files. I suggest to first verify that the results of this conversion are correct. And merge it as is.

  1. I wanted to remove all the string matching for xml stuff, but this was too much to do in one go. I did start though, some parameters from pepxml files are now being parsed using a stox parser, look at StoxParserPepxml (should have been Stax, it's a typo) and its usage in PercolatorOutputToPepXML.

I wouldn't use JAXB here because it has to parse the whole file into memory and is relatively slow. Also very memory intensive, the in-memory representation generated by JAXB parser is often larger than the original file on disk (because of all the lists it creates internally). And combined interact files can actually be really huge in some cases, so I'd suggest to only resort to stax parsing.

  1. In the new code the whole "rewriting" part is contained in a single function createInteractFile(), actually just in one 80-line block inside it. So all of that "use sax/stax/jaxb etc to rewrite file" is just about updating these 80 lines to use stax instead of reading files line by line. But it still requires step 1 to be done.

chhh avatar Dec 20 '22 19:12 chhh

Hi @chhh ,

Thank you for the explanation and effort. I have briefly reviewed the code. I think the changes can be classified into:

  1. Variable renaming and minor refactor
  2. Adding Comet support
  3. Code refactor to make it more robust (not so sure) and easy to maintain.

I think for change type 1, need to undo them to make the other changes easy to track. For change type 2, also need to undo since FragPipe don't support Comet. Leave those code in FragPipe would confuse others, including me ;). I also don't think I have the bandwidth to support both search engines in the future.

After reverting type 1 and 2, I will review the code and merge them if they pass the tests.

Merry Christmas,

Fengchao

fcyu avatar Dec 24 '22 04:12 fcyu

I recently got an error that seems related to this:

Cannot find output_report_topN parameter from .... pepXML Process 'Percolator: Convert to pepxml' finished, exit code: 1 Process returned non-zero exit code, stopping.

This was after many previous files successfully ran through percolator, making me think it is a bug.

log file attached- log_2024-10-14_15-08-30.txt

asalt avatar Oct 14 '24 21:10 asalt

Hi @asalt, I don't think your error is related to this pull request because it hasn't been merged.

Please do not send questions or issues to pull requests. Please re-submit it to https://github.com/Nesvilab/FragPipe/issues

Best,

Fengchao

fcyu avatar Oct 14 '24 23:10 fcyu