FragPipe icon indicating copy to clipboard operation
FragPipe copied to clipboard

Some proteins appearing in protein file but not peptide file

Open ndtivendale opened this issue 2 years ago • 60 comments

Describe the bug In my output files, the protein.tsv file contains some proteins that are not present in the peptide.tsv file or the [filename].tsv file. Can you explain to me what is going on?

Also, can you explain the difference between the peptide.tsv file and the [filename].tsv file? I understand they have different headers, but what is the difference apart from that?


If you're submitting a bug report, please attach log file

The log file can be saved from FragPipe:

  • by clicking the Export Log button on the Run tab.
  • or just copy text from the output console on Run tab to a text file.

ndtivendale avatar Apr 13 '22 00:04 ndtivendale

Hi @ndtivendale , To help you I first need to see what you're doing, so please share your files , including the logs and the outputs. [filename].tsv, is not part of the Philosopher output.

prvst avatar Apr 13 '22 13:04 prvst

OK. Here are the output files for one rep. milla00490592b.xlsx protein_t-24_2.xlsx psm_t-24_2.xlsx

milla00490592b was the original file name. What is the difference between this output file and the psm output file?

And why do some proteins appear in the protein file but not the psm or milla00490592b file?

Here is the log. log_2022-04-13_11-37-40.txt

ndtivendale avatar Apr 18 '22 01:04 ndtivendale

You are using an old version of the tools

Version info: FragPipe version 16.0 MSFragger version 3.3 Philosopher version 4.0.0 (build 1626989421)

Please upgrade to the latest FragPipe 17.1 and the latest philosopher. There were many fixes since the versions you used, so we cannot go back to look at your files. If you still see an issue with the latest versions we will be able to investigate

Best Alexey

Get Outlook for iOShttps://aka.ms/o0ukef


From: Nathan @.> Sent: Monday, April 18, 2022 9:12:50 AM To: Nesvilab/FragPipe @.> Cc: Subscribed @.***> Subject: Re: [Nesvilab/FragPipe] Some proteins appearing in protein file but not peptide file (Issue #646)

External Email - Use Caution

OK. Here are the output files for one rep. milla00490592b.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502802/milla00490592b.xlsx protein_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502803/protein_t-24_2.xlsx psm_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502804/psm_t-24_2.xlsx

milla00490592b was the original file name. What is the difference between this output file and the psm output file?

And why do some proteins appear in the protein file but not the psm or [milla00490592b file?

milla00490592b.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502802/milla00490592b.xlsx protein_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502803/protein_t-24_2.xlsx psm_t-24_2.xlsxhttps://github.com/Nesvilab/FragPipe/files/8502804/psm_t-24_2.xlsx

Here is the log log_2022-04-13_11-37-40.txthttps://github.com/Nesvilab/FragPipe/files/8502810/log_2022-04-13_11-37-40.txt g

— Reply to this email directly, view it on GitHubhttps://github.com/Nesvilab/FragPipe/issues/646#issuecomment-1100999168, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIIMM6YYLAK2UHCQ5OY6HTLVFSZJFANCNFSM5TJBAXXA. You are receiving this because you are subscribed to this thread.Message ID: @.***>


Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

anesvi avatar Apr 18 '22 02:04 anesvi

OK, but when I tried to do that recently, I did not get any Arabidopsis proteins. Only things like human keratin and porcine trypsin. There is a problem with the new version (not sure which tool is causing the problem), which I reported in a separate issue thread.

ndtivendale avatar Apr 18 '22 03:04 ndtivendale

@ndtivendale Are you using a database from TAIR ?

prvst avatar Apr 18 '22 14:04 prvst

Yep. Can't upload it here because it's in fasta format but here is converted to a df in r. tair10_fasta_dataframe.csv

ndtivendale avatar Apr 19 '22 00:04 ndtivendale

@ndtivendale You had an issue because the version you have does not know how to read a TAIR database header. You can grab a pre-release version which contains the function you need

https://www.dropbox.com/work/Public/Philosopher/Release%20Candidate

prvst avatar Apr 19 '22 13:04 prvst

@prvst Thanks, I will do that.

That's solved one issue. The other issues remain though.

  1. Why are some proteins appearing in the protein file but not the peptide file?
  2. What is the difference between the peptide.tsv, psm.tsv and the [filename].tsv file? They are different lengths, I can see that. And for individual protein IDs there are fewer peptides in the peptide file than the psm.tsv and [filename].tsv file, so I assume it's some sort of filtering step, but I would like to know what filters are being applied.

ndtivendale avatar Apr 20 '22 00:04 ndtivendale

@prvst is there another way you can share that file? I can't seem to get it from dropbox. Something to do with needing a Dropbox Business Account.

ndtivendale avatar Apr 20 '22 00:04 ndtivendale

@ndtivendale [filename].tsv files are not created by philosopher, so I can;t tell you why exactly they are different. Regarding your fist question, I'm afraid I'm going to need some examples. If you can generate new outputs with the version in the link, then I'll be able to look at that for you. The Dropbox link should be public. I reviewed the permissions. Please try again.

https://www.dropbox.com/sh/0mr4zbprhaxk453/AADdLawYWnQ_-tekDkdnLscWa?dl=0

prvst avatar Apr 20 '22 13:04 prvst

@ndtivendale the [filename].tsv files are generated by MSFragger before any FDR filtering is done, so they will contain more (typically many more) entries than the Philosopher outputs since they have no FDR applied. They are typically not needed for most analyses - you can set the MSFragger output to just "pepXML" rather than "pepXML_tsv" if you don't want them.

dpolasky avatar Apr 20 '22 13:04 dpolasky

@dpolasky OK. But what is the difference between psm and peptide outputs?

ndtivendale avatar Apr 20 '22 23:04 ndtivendale

The PSM table contains the list of all (FDR approved) PSMs from the experiment.The peptide table is the list of(FDR approved) peptides. To make this list, we collapse all PSMs to the peptide sequence.

prvst avatar Apr 20 '22 23:04 prvst

@prvst OK, I'll try that. In the meantime, here are three examples of proteins that are in the protein file but not the psm file for the replicate that I posted at the beginning of the thread. There are 175 more examples of such proteins in this replicate. AT1G12010.1 AT1G47500.1 AT2G29470.1

ndtivendale avatar Apr 21 '22 00:04 ndtivendale

@prvst, thank you.

ndtivendale avatar Apr 21 '22 01:04 ndtivendale

@dpolasky, so there is no filtering applied to the [filename] file at all?

ndtivendale avatar Apr 21 '22 01:04 ndtivendale

@ndtivendale there's almost no filtering - MSFragger will only report spectra with at least minimum_peaks ions in the spectrum and min_matched_fragments ions matched to a peptide, but nothing other than that (nothing related to score or FDR control).

dpolasky avatar Apr 21 '22 15:04 dpolasky

@dpolasky Thanks. How does the FDR filtering work? This may be getting a little off topic, but I want to understand.

ndtivendale avatar Apr 22 '22 00:04 ndtivendale

@ndtivendale Without going too much into the details, the 'raw' output from MSFragger gets modeled in PeptideProphet (and has protein inference done in ProteinProphet) before the actual FDR filtering done by Philosopher. It looks like you're using the "sequential" filtering approach from the log files, which means that a first pass FDR is done to 1% at the PSM, ion, peptide, and protein levels, and then a second pass is done to remove any PSMs/etc from proteins that did not pass the protein FDR filter. You can see that happening in the log - the output of the filter command shows the numbers passing the filter after the first and second passes, and the number of decoy PSMs after the second pass typically drops to well below the actual set FDR. This is within each experiment group, so synchronizing across many groups can be tricky as different proteins may pass FDR in different sub-groups

dpolasky avatar Apr 22 '22 14:04 dpolasky

OK. That answers one question and I thank you all for that. But what about the main issue I raised. Why are some proteins present in the protein file but not the psm file? I could understand the other way around. I could understand a peptide being mapped to a particular protein in the psm file but then filtered out in the protein file, but if it's present in the protein file, surely it should be present in the psm file as well, right?

ndtivendale avatar May 10 '22 01:05 ndtivendale

Hi @ndtivendale. As we mentioned above, and in the other issue you opened, you're reporting a problem with a quite old version of philosopher and fragpipe. Please update your tools to the latest version, run them again, and report back if you still see discrepancies.

prvst avatar May 10 '22 13:05 prvst

OK, so I have generated some data from the new version. I've attached the psm and protein files for one sample. There are proteins that appear in the protein file, but not in the psm file. For example AT1G01820.1. psm_000_1.xlsx protein_000_1.xlsx log_2022-05-18_20-05-04.txt

ndtivendale avatar May 20 '22 07:05 ndtivendale

The protein is actually in the file, but it's classified as an alternative protein, not as a maing identification. I'll try to reproduce your case here.

prvst avatar May 20 '22 13:05 prvst

@prvst Where? Sorry, I am confused.

ndtivendale avatar May 25 '22 06:05 ndtivendale

AT1G01820.1 can be found in the Mapped proteins in the psm table, and its in the Protein column in the protein table.

prvst avatar May 25 '22 18:05 prvst

@prvst OK. So there is evidence that AT1G01820.1 is in the sample but the peptide supporting it is better mapped to another protein? I'm confused. Is there a tutorial I can look at?

ndtivendale avatar May 26 '22 03:05 ndtivendale

Exactly, your target protein is sharing peptides with another protein. The tools that perform the validation and the inference determined that they should consider the other protein the "main" identification based on their criteria of validation. If you want to read more about how these steps happen, and what the tools are doing, I suggest this really nice review from Alexey.

prvst avatar May 26 '22 13:05 prvst

@prvst OK, but I still have a problem. It's a smaller problem now, but there are still proteins that appear in the protein file but not in the psm table, even in the Mapped Proteins column. For example, in the files I shared earlier, AT4G08140.1 is listed in the protein file but not in the psm file.

ndtivendale avatar May 31 '22 02:05 ndtivendale

@ndtivendale Can you generate a new set of results using the releases from Friday? Lets check if you still see this happening with the latest release, if so I can take a look for you.

prvst avatar May 31 '22 14:05 prvst

OK, I did that. There are nowhere near as many examples now, but there are still some. There are two examples now in the attached files. The examples are AT4G05590.2 and AT4G05590.1. protein_t-24_2.tsv.xlsx psm_t-24_2.tsv.xlsx

ndtivendale avatar Jun 02 '22 01:06 ndtivendale