proteomicslfq icon indicating copy to clipboard operation
proteomicslfq copied to clipboard

Fraction Column is lost and reevaluated by MSStats

Open tillenglert opened this issue 4 years ago • 19 comments

I'm currently adding MSFragger as a search engine for ProteomicsLFQ. When running the minimal test profile I ran into an issue with MSstats. The tool could not figure out the fractionation of the samples and stopped the executation with following message:

"** It is hard to find the same fractionation across sample, due to lots of overlapped features between fractionations.
	                 Please add Fraction column in input."

Now searching for the reason of this issue I looked into the source code of MSstats and the function OpenMStoMSstatsFormat, which preprocesses the data for MSstats before doing the dataProcess function. This function also just takes the required columns of the out.csv of proteomicslfq which are the following:

requiredinput.general <- c("ProteinName", "PeptideSequence", "PrecursorCharge", 
                                "FragmentIon", "ProductCharge", "IsotopeLabelType",
                                "Condition", "BioReplicate", "Run", "Intensity")

source: https://rdrr.io/bioc/MSstats/src/R/OpenMStoMSstatsFormat.R (MSstats 3.22)

Which leads to the loss of the Fraction Column. This was not leading to an Error when using Comet or MSGF+ search engines, as MSstats is analysing the features and can detect if its Technical Replicates or Fractionated Samples if the features are clear enough. I guess the problem in MSFragger was that it found too many overlapping features and at the same time too many duplicated features across fractions and samples.

When testing the newest version of MSstats (4.2) it could actually correctly assign the fractions. The latest version is dependent on MSstatsConvert which includes the conversion tools for different MS tools. So maybe it would make the ProteomicsLFQ pipeline more robust to errors especially as the information of fractions is lost.

tillenglert avatar Nov 23 '21 11:11 tillenglert

I think it would be better if openms just exports a fraction column correctly. Instead of hoping for a correct guess b Msstats.

jpfeuffer avatar Dec 01 '21 23:12 jpfeuffer

@timosachsenberg I have no idea why this is not the case. I thought we export everything.

jpfeuffer avatar Dec 01 '21 23:12 jpfeuffer

I also did a PR to MSstats once to address this issue. Maybe it did not make it into 3.22? Did you check 3.22.1 or whatever ele came before 4? Because I never made 4 work with newer OpenMS versions because OpenMS does not build on bioconda anymore and is incompatible with some dependencies I think.

jpfeuffer avatar Dec 02 '21 03:12 jpfeuffer

https://github.com/Vitek-Lab/MSstats/commit/d78e2aadb6732d363a04503b76dc2297384c30c9

jpfeuffer avatar Dec 02 '21 03:12 jpfeuffer

https://github.com/Vitek-Lab/MSstats/blob/3a3acbbd37f3cdebbb8db7bf165c96306f732e2d/R/converters.R#L234

Seems not to be in the code anymore, after they changed their code structure!

tillenglert avatar Dec 02 '21 05:12 tillenglert

I tested with 3.22.1, which should be the latest version before v4.

And yes v4 is not compatible in any case to be used in the nfcore/proteomicslfq docker... For testing (v4.2.0) I had to build another container.

tillenglert avatar Dec 02 '21 06:12 tillenglert

@timosachsenberg I have no idea why this is not the case. I thought we export everything.

Yeah, we checked. We export it, and it seems that the issue is on the MSstats side (see Till's comments).

timosachsenberg avatar Dec 02 '21 07:12 timosachsenberg

Can you find out why it is not compatible? In theory the openms::openms2.7.0pre package should be built with the latest conda packages. 2.6.0 from bioconda is of course outdated. It could be that some thirdparties clash in the openms-thirdparty package. I already removed some of them (maybe some of them can be fixed by conda rebuilds/updates). In the worst case we use openms and only add the ones we need separately.

I think this would be the way forward. Otherwise we need to monkey patch the function in our R code. I remember having done such a thing before in my own scripts.

jpfeuffer avatar Dec 02 '21 15:12 jpfeuffer

proteomicslfq_docker_build.log

Attached is the log of the dockerfile build of nf-core/proteomicslfq with the following environment.yml:

name: nf-core-proteomicslfq-1.0.0 channels:

  • openms
  • conda-forge
  • bioconda dependencies:
  • openms::openms
  • openms::openms-thirdparty
  • bioconda::bioconductor-msstats=4 # will include R
  • bioconda::sdrf-pipelines=0.0.9 # for SDRF conversion
  • conda-forge::r-ptxqc=1.0.5 # for QC reports
  • conda-forge::xorg-libxt=1.2.0 # until this R fix is merged: https://github.com/conda-forge/r-base-feedstock/pull/128
  • conda-forge::fonts-conda-ecosystem=1 # for the fonts in QC reports
  • conda-forge::python=3.8.5
  • conda-forge::markdown=3.2.2
  • conda-forge::pymdown-extensions=8.0.1
  • conda-forge::pygments=2.7.1

So there are conflicts but conda can't figure out where.

tillenglert avatar Dec 03 '21 15:12 tillenglert

I would try "mamba" to find the conflicts. Conda is basically useless for this. And in this case even seems to be bugged. I think you can just install mamba instead of conda and use the same commands.

jpfeuffer avatar Dec 03 '21 19:12 jpfeuffer

After some testing I finally managed to include MSstats v4.2, but for this I needed to change the version of python (to v3.9) and ptxqc (to v1.0.12). Unfortunately, this leads to an error in ptxqc when running the test profile. The current environment is:

name: nf-core-proteomicslfq-1.0.0 channels:

  • openms
  • conda-forge
  • bioconda dependencies:
  • openms::openms=2.7.0pre
  • openms::openms-thirdparty=2.7.0pre
  • bioconda::bioconductor-msstats=4.2 # will include R
  • bioconda::sdrf-pipelines=0.0.9 # for SDRF conversion
  • conda-forge::r-ptxqc=1.0.12 # for QC reports
  • conda-forge::xorg-libxt=1.2.0 # until this R fix is merged: https://github.com/conda-forge/r-base-feedstock/pull/128
  • conda-forge::fonts-conda-ecosystem=1 # for the fonts in QC reports
  • conda-forge::python=3.9
  • conda-forge::markdown=3.2.2
  • conda-forge::pymdown-extensions=8.0.1
  • conda-forge::pygments=2.7.1

The error of ptxqc is the following:

Loading required package: PTXQC Loading package PTXQC (version 1.0.12) Error in file.exists(pattern = mqpar_filename) : invalid 'file' argument Calls: createReport -> getMetaFilenames -> getMQPARValue -> file.exists In addition: Warning messages: 1: In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, : Some parent terms not found: MS:1001456 2: In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, : Some parent terms not found: UO:0000000 Execution halted

tillenglert avatar Dec 05 '21 07:12 tillenglert

I will ask @cbielow if he knows what the issue is here

timosachsenberg avatar Dec 05 '21 07:12 timosachsenberg

I cannot find anything obviously wrong with the code in PTXQC. There should be a warning() (not an error) on the console which provides further details if mqpar.xml cannot be found, but your output has none... this is a bit strange. Can someone point me to the script and the data that you are actually running?!

cbielow avatar Dec 06 '21 08:12 cbielow

Why does it want an mqpar.xml at all? We input mztab.

jpfeuffer avatar Dec 06 '21 17:12 jpfeuffer

its quite an unusual combination indeed, but the mqpar.xml is used to find some threshold parameters, if available.

cbielow avatar Dec 07 '21 08:12 cbielow

The script I'm using is this nextflow script:

https://github.com/tillenglert/proteomicslfq/blob/master/main.nf#L1304

with this config (testfiles): https://github.com/tillenglert/proteomicslfq/blob/master/conf/test.config#L20

As I'm still working on msfragger I tested the ptxqc process with comet. The logs and inputfiles are attached to this comment: ptxqc_logs.zip

tillenglert avatar Dec 07 '21 08:12 tillenglert

the error is fixed in the current development version of PTXQC. It will be some time before the new version is published.

Since this is a regression, the last working version should be PTXQC v1.00.10 - May 2021. If you can use that version for the time being, the bug should be resolved.

cbielow avatar Dec 07 '21 08:12 cbielow

Ah perfect! I haven't tried this version, but it's working and compatible with the remaining packages.

This is the current environment I'm using, which is working vor msstats and ptxqc:

name: nf-core-proteomicslfq-1.0.0 channels:

  • openms
  • conda-forge
  • bioconda dependencies:
  • openms::openms=2.7.0pre
  • openms::openms-thirdparty=2.7.0pre
  • bioconda::bioconductor-msstats=4.2 # will include R
  • bioconda::sdrf-pipelines=0.0.9 # for SDRF conversion
  • conda-forge::r-ptxqc=1.0.10 # for QC reports
  • conda-forge::xorg-libxt=1.2.0 # until this R fix is merged: https://github.com/conda-forge/r-base-feedstock/pull/128
  • conda-forge::fonts-conda-ecosystem=1 # for the fonts in QC reports
  • conda-forge::python=3.9
  • conda-forge::markdown=3.2.2
  • conda-forge::pymdown-extensions=8.0.1
  • conda-forge::pygments=2.7.1

tillenglert avatar Dec 07 '21 09:12 tillenglert

Feel free to open a PR with the environment update

jpfeuffer avatar Dec 07 '21 13:12 jpfeuffer