FlashLFQ icon indicating copy to clipboard operation
FlashLFQ copied to clipboard

Support for Percolator output files

Open wsnoble opened this issue 3 years ago • 77 comments

It would be great if Percolator PSM-level tab-delimited output files could be supported by FlashLFQ. I looked into this, and there are two major and several minor challenges associated with making this happen.

The first major issue is that Percolator does not include the mzML file name, but only an integer file index, in the output file. This is obviously problematic, and it's something we can look into fixing on the Percolator end of things.

The second major issue is that Percolator does not include retention time. This is harder for us to fix, because this information is also not included in the outputs of many common search engines. It seems like, if you have the scan numbers in the Percolator output, it should be feasible for FlashLFQ to grab the RT from the mzML file. Is this doable?

The other minor issues are that Percolator does not have a "Base Sequence" column and that Percolator uses comma-delimited protein ID lists, rather than semi-colon delimited lists. These latter ones, along with differences in column naming, should be easy to handle.

Here is a tiny sample Percolator output file. sample.txt

wsnoble avatar Dec 22 '21 23:12 wsnoble

Thanks for the request Bill. Students are evaporating inverse to the proximity of the holiday. But we'll engage on this and see how we can accommodate you.

trishorts avatar Dec 23 '21 15:12 trishorts

Ah, I thought "Scan" and "Retention Time" were two different column IDs, but it turns out it's a single column called "Scan Retention Time." Hence, my suggestion above does not make sense. Is there any way you could have an option to provide scan numbers instead of RTs?

wsnoble avatar Dec 27 '21 19:12 wsnoble

Sorry for the slow response. Just back from holiday.

I'd love to say that we'd accommodate you but I don't see an easy way to do it. All of the peak finding, etc. is based on time and time windows. With scan number, there is no straightforward way to convert. The time between MS1 scans is variable based on the number of MSn scans between, and that changes throughout the run. So, there is not a good way to look w/in a scan range. And the time for MS1s and MSns also varies, so no simply calculation can be performed. We'd need a time for each kind of scan. The easiest fix for us would be for you to add an additional column of retention time in minutes that percolator otherwise ignores. I'm happy to entertain other options if you can suggest some.

trishorts avatar Jan 03 '22 15:01 trishorts

I think my suggestion perhaps was not clear. I totally agree that you need to get the RT associated with each scan. My suggestion was to enable a mode in which the input file specifies scan numbers and then the RT is read from the given mzML file.

For example, if you are using pyteomics to parse an mzml file, you can get the RT for each scan as follows:

from pyteomics.mzml import MzML

def get_rts(mzml_filename):
    "Extract retention times from an mzml file."

    rts = {} # Key = scan number, value = retention time
    mzml = MzML(mzml_filename)
    for scan in mzml:
        rts[scan["index"]] = scan["scanList"]["scan"][0]["scan start time"]
    sys.stderr.write(f"Read {len(rts)} RTs from {mzml_filename}.\n")

    return(rts)

Requiring Percolator to report the RT is problematic because Percolator does not typically have access to the original mzML file. So we would have to require that every search engine report RT in its output file format, and that Percolator parse and report these RTs in its output. It seems much easier, since FlashLFQ receives the mzML file as input, for it to parse the RTs directly out of the given mzML files. Does this make sense?

wsnoble avatar Jan 03 '22 21:01 wsnoble

Yes. I will make an attempt. What are your plans for including filenames with file path in the percolator output? currently, there is only an integer. would you require users to make the substitution?

trishorts avatar Jan 04 '22 14:01 trishorts

I am hoping we can add this functionality to Percolator directly. I'm waiting to hear from Lukas about this:

https://github.com/percolator/percolator/issues/322

wsnoble avatar Jan 04 '22 17:01 wsnoble

I saw that issue cuz I am following the percolator issues. Glad you raised it. I'm already working on the coding updates to flashlfq that will allow retention time recovery from scan number.

can you confirm that "spectrum neutral mass" is the experimental neutral mass and "peptide mass" is the theoretical neutral mass?

I think I have that right but two minutes on google didn't eliminate my doubts.

trishorts avatar Jan 04 '22 17:01 trishorts

Yes, that's right.

"The computed mass of the spectrum precursor at the given charge state. This is equal to the precursor m/z minus the mass of a proton (1.00727646677 Da), all multiplied by the charge."

"The mass of the peptide sequence, computed as the sum of the amino acid masses plus the mass of water (18.010564684 Da or 18.0153 Da, depending on whether we are using monoisotopic or average mass)."

http://crux.ms/file-formats/txt-format.html

wsnoble avatar Jan 04 '22 18:01 wsnoble

can you provide me with one raw/mzml file and its companion output so that I can test my code? I have one remaining concern about whats in the sequence column. we need to be able to parse that and compute peptide mass. The trick will be how ptms are annotated.

trishorts avatar Jan 05 '22 15:01 trishorts

Here you go! Lemme know if you need more info. percolator.target.psms.txt

(The raw file is too big to be attached -- will try to get it to you some other way.)

wsnoble avatar Jan 06 '22 18:01 wsnoble

thanks. i can make a box folder or you can. please invite [email protected]

trishorts avatar Jan 06 '22 18:01 trishorts

Here is a link: https://drive.google.com/file/d/1OTLCgkqcsA2ij8H8edY08-DZG1KPOwxb/view?usp=sharing

wsnoble avatar Jan 06 '22 18:01 wsnoble

ptms annotated by mass shift sub-optimal.......[15.99] [79.97] we'll see what i can do

trishorts avatar Jan 06 '22 18:01 trishorts

species?

trishorts avatar Jan 06 '22 18:01 trishorts

Human. It's from here:

http://www.ncbi.nlm.nih.gov/pubmed?term=26951766

ftp://massive.ucsd.edu/MSV000079437/raw/

wsnoble avatar Jan 06 '22 18:01 wsnoble

By "suboptimal" do you mean that it should be four digits of precision?

wsnoble avatar Jan 06 '22 22:01 wsnoble

no. like a name of a specific mod where we could have a specific chemical formula.

the two listed I can easily guess. but many mods from search engines are much more obscure and everyone likes to do things a little differently. The "field" lacks a convention, so it can be hard on downstream processing.

In MetaMorpheus, we specify the mod by full name and residue after the amino acid involved and we have a library that is used to look up the masses. not important for flashlfq, but we also include diagnostic ions and so forth in the library that are used for annotation. And, we have access to all the typical mod databases (unimod, uniprot,...).

Point is that with a nominal mass it is impossible to produce a chemical formula and without a chemical formula you can really make a great isotope envelope to match to the various MS1 and MS2 spectra. Sorta left w/ an averagine. I gotta see how I can deal w/ this.

trishorts avatar Jan 06 '22 22:01 trishorts

OK, I finally got around to trying this out. I hacked a Percolator output file to swap out filenames for file indices based on the log file. But I'm having trouble getting it to work, presumably because of some issue around absolute versus relative pathnames. Also, I'm getting an error message saying that it can't read the first line, but I don't know what's wrong with it. Can you help figure this out?

Here is the command and output:

+ flashlfq --idt percolator.target.psms.fixed.txt --rep /net/noble/vol2/user/kiahales/projects/plasmo-rdeep/data/ms2
Opening PSM file percolator.target.psms.fixed.txt
Problem reading line 1 of the identification file; Could not interpret PSM header labels from file: percolator.target.psms.fixed.txt
No peptide IDs for the specified spectra files were found! Check to make sure the spectra file names match between the ID file and the spectra files

percolator.target.psms.fixed.txt.gz

wsnoble avatar Apr 29 '22 17:04 wsnoble

i'll look at it.

trishorts avatar Apr 29 '22 17:04 trishorts

can you get me one or two of the mzmls that i can use for trouble shooting?

trishorts avatar Apr 29 '22 17:04 trishorts

It looks like the mzMLs were actually gzipped. I thought that might be the source of the problem, so I unzipped them but alas, still no luck (i.e., same error messages).

Here is a sample mzML:

https://drive.google.com/file/d/1hcndRYhhcWXctnjxVV4_tskatrYUicLz/view?usp=sharing

wsnoble avatar Apr 29 '22 20:04 wsnoble

Thanks

trishorts avatar Apr 29 '22 20:04 trishorts

not sure if this is the big problem but the sample file you gave me has an empty scan that is throwing an error: image

it is scan 7192 image

trishorts avatar May 02 '22 13:05 trishorts

The second problem I see is that one column header is fileidx instead of file_idx. Maybe the the column header changed in the output? I can make a pull request to take either if that is something that needs to be done. Please let me know on that.

trishorts avatar May 02 '22 14:05 trishorts

The third problem i see is in the protein id column. We need to be able to assign a protein to each psm. The conventional format that you showed me previously is for example "sp|P23588|IF4B_HUMAN" where "|" is the separator and there is an accession followed by a gene name. We need both pieces of info. In you example, the column has some other value that is not parsable. Your output is pasted below. Could probably deal w/o the gene name but we can't go back and forth between formats. The protein is necessary for grouping the intensities. So that has to be clean. image

trishorts avatar May 02 '22 14:05 trishorts

The one open question I have is still if we can read the paths that you have. I generally don't put the full path in the file_idx column. But then again I don't run on the commandline. i'll see if I can look into this.

trishorts avatar May 02 '22 14:05 trishorts

Manually deleting empty scans is problematic b/c of missing scan numbers. We do see dead scans on fast instruments. We like to throw errors where there is a problem w/ the input so that users know there is a problem w/ the input. But maybe it makes sense to just report the dropped scans and try to ignore them. I'll talk to the other devs about this. Tricky problme

trishorts avatar May 02 '22 14:05 trishorts

Regarding the missing scan problem, I agree that it would be nice if the software could issue a warning in that case. If you don't like that, an alternative would be to introduce a command line option that controls this behavior (e.g., --missing-scans warn|fail). The default could be "fail," but the error message could advise the end user to switch to "--missing-scans warn" if they want to.

Regarding the column name, that's my bad. I inadvertently renamed the column when I was swapping in the filenames for the indices.

Regarding the protein IDs, unfortunately we have no control over what naming scheme our users employ. All we do is take the IDs from the FASTA file. If there are multiple protein IDs associated with a single peptide, these are separated with commas. I am not sure why you need both the accession and the gene name. Are you assuming that the proteins the user is interested will always be in a public database somewhere? I don't think we can assume that in general (e.g., a database generated from an RNA-seq sample).

wsnoble avatar May 02 '22 20:05 wsnoble

Regarding the protein IDs, unfortunately we have no control over what naming scheme our users employ. All we do is take the IDs from the FASTA file. If there are multiple protein IDs associated with a single peptide, these are separated with commas. I am not sure why you need both the accession and the gene name. Are you assuming that the proteins the user is interested will always be in a public database somewhere? I don't think we can assume that in general (e.g., a database generated from an RNA-seq sample).

We can take every unique string as "protein" name. That will mean that you would get proteins name "sp|P23588|IF4B_Human". I had made a parser based on your previous example data where accessions were separated by genes using the vertical bar.

If they don't mind post-processing everything, I guess it doesn't matter on our end. The gene, as you say, is unnecessary i believe. Can I assume all those PF3D7s should be treated is separate proteins?

trishorts avatar May 02 '22 20:05 trishorts

Regarding the missing scan problem, I agree that it would be nice if the software could issue a warning in that case. If you don't like that, an alternative would be to introduce a command line option that controls this behavior (e.g., --missing-scans warn|fail). The default could be "fail," but the error message could advise the end user to switch to "--missing-scans warn" if they want to.

I am updating our mzML reader to ignore empty scans and provide their scan numbers. see https://github.com/smith-chem-wisc/mzLib/pull/628

trishorts avatar May 02 '22 20:05 trishorts