pyopenms-docs icon indicating copy to clipboard operation
pyopenms-docs copied to clipboard

Get wrong MassTraceDetection results

Open xieyying opened this issue 1 year ago • 17 comments

Describe the problem you encountered I simulated a LC-MS data from a compound and analyze it using pyopenms through Untargeted Metabolomics Pre-Processing pipline. But I can not get the expected mass traces even by exhaustive tuning the parameters. I also view our data using TOPPView. and the expected mass trace all can be found. I also check iod_df generated by MzMLFile().load() and I also can find our expected mass.

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

What should be happening I should find the expected mass trace at m/z 148.5713 (the most intense), 149.0726, 149.5749, and 150.0755. But now, I just can find mass traces at m/z 149.5741 and 150.0755 by MassTraceDetection().

Screenshots image

System information:

  • OS: Windows-10-10.0.19045-SP0/Linux-5.4.0-159-generic-x86_64-with-glibc2.31
  • OS Version: 10.0.19045-SP0
  • Python version: 3.10.13/3.11.4
  • pyopenms version: 3.1.0
  • How did you install pyopenms? Please cross a box with an X.
    • [ ] conda::bioconda: Specify your conda/mamba command and/or your environment with conda list.
    • [ ] conda::openms: Specify your conda/mamba command and/or your environment with conda list.
    • [X ] pip
    • [ ] nightly wheel
    • [ ] built from source

Additional context The .mzML and .py file is not supported to uploaded. and Here I just pasted our code: class FeatureDetection: """ Extract features from mzML files using pyopenms """ def init(self, file): self.file = file self.exp = oms.MSExperiment() self.mass_traces = [] self.mass_traces_deconvol = [] self.feature_map = oms.FeatureMap() self.chrom_out = []

def load_file(self):
    """Load the mzML file"""
    oms.MzMLFile().load(self.file, self.exp)

def get_dataframes(self):
    """Get the ion dataframes and filter them"""
    self.ion_df = self.exp.get_ion_df()
    self.ion_df = self.ion_df[self.ion_df['inty'] > 0.0]
    # self.ion_df.to_csv(r'C:\Users\xyy\Desktop\test\tem\ion_df.csv', index=False)
    

def mass_trace_detection(self):
    """Detect the mass traces"""
    mtd = oms.MassTraceDetection()
    mtd_par = mtd.getDefaults()
    # print(mtd_par)
    mtd_par.setValue("mass_error_ppm", 20.0)
    mtd_par.setValue('min_trace_length', 1.0)
    mtd_par.setValue("noise_threshold_int", 0.0)
    mtd_par.setValue("chrom_peak_snr", 0.0)
    mtd_par.setValue('trace_termination_criterion', 'sample_rate')
    mtd_par.setValue('min_sample_rate', 0.5)
    mtd.setParameters(mtd_par)
    mtd.run(self.exp, self.mass_traces, 0)
    print(len(self.mass_traces))
    for i in range(len(self.mass_traces)):
        trace = self.mass_traces[i]
        print(f"Trace {i}:")
        print(f"Centroid MZ: {trace.getCentroidMZ()}")
        print(f"Centroid RT: {trace.getCentroidRT()}")
        print(f"Size: {trace.getSize()}")
        print("--------------------")    

def elution_peak_detection(self):
    """Detect the elution peaks"""
    epd = oms.ElutionPeakDetection()
    epd_par = epd.getDefaults()
    epd_par.setValue("width_filtering", "fixed")
    # print(epd_par)
    epd.setParameters(epd_par)
    epd.detectPeaks(self.mass_traces, self.mass_traces_deconvol)
    print(len(self.mass_traces_deconvol))


def feature_detection(self):
    """Detect the features"""
    ffm = oms.FeatureFindingMetabo()
    ffm_par = ffm.getDefaults()
    ffm_par.setValue("local_rt_range", 10.0)
    ffm_par.setValue("local_mz_range", 6.5)
    ffm_par.setValue("chrom_fwhm", 5.0)
    ffm_par.setValue('enable_RT_filtering', 'true')
    ffm_par.setValue('mz_scoring_by_elements', 'true')
    ffm_par.setValue("elements", "CHNOPSClBrFeBSeIF")
    ffm_par.setValue('isotope_filtering_model','none')
    ffm_par.setValue("remove_single_traces", "true")
    ffm_par.setValue("report_convex_hulls", 'true')
    ffm.setParameters(ffm_par)
    ffm.run(self.mass_traces_deconvol, self.feature_map, self.chrom_out)
    self.feature_map.setUniqueIds()
    self.feature_map.setPrimaryMSRunPath([self.file.encode()])
    print(self.feature_map.get_df())
    print(len(self.feature_map.get_df()))

def run(self):
    """Run the feature detection process"""
    self.load_file()
    self.get_dataframes()
    self.mass_trace_detection()
    self.elution_peak_detection()
    self.feature_detection()
    return self.feature_map

xieyying avatar Jun 16 '24 01:06 xieyying

Here I post the link of mass data: https://drive.google.com/drive/folders/1CZBRSU_hdhq7UHXdqJYQSrMUVcSfarGy?usp=drive_link

xieyying avatar Jun 16 '24 02:06 xieyying

Hi! Just to clarify, what is the exact output? You are saying that already the first step (i.e. MassTraceDetection) failed, right?

Do you think you could reduce the example/code to the first part that fails?

jpfeuffer avatar Jun 16 '24 07:06 jpfeuffer

Sure, Here is the reduced code: import pyopenms as oms

class FeatureDetection: """ Extract features from mzML files using pyopenms """ def init(self, file): self.file = file self.exp = oms.MSExperiment() self.mass_traces = []

def load_file(self):
    """Load the mzML file"""
    oms.MzMLFile().load(self.file, self.exp)

def get_dataframes(self):
    """Get the ion dataframes and filter them"""
    self.ion_df = self.exp.get_ion_df()
    print(self.ion_df)
    
def mass_trace_detection(self):
    """Detect the mass traces"""
    mtd = oms.MassTraceDetection()
    mtd_par = mtd.getDefaults()
    # print(mtd_par)
    mtd_par.setValue("mass_error_ppm", 20.0)
    mtd_par.setValue('min_trace_length', 1.0)
    mtd_par.setValue("noise_threshold_int", 0.0)
    mtd_par.setValue("chrom_peak_snr", 0.0)
    mtd_par.setValue('trace_termination_criterion', 'sample_rate')
    mtd_par.setValue('min_sample_rate', 0.5)        
    mtd.setParameters(mtd_par)
    mtd.run(self.exp, self.mass_traces, 0)
    print(len(self.mass_traces))
    for i in range(len(self.mass_traces)):
        trace = self.mass_traces[i]
        print(f"Trace {i}:")
        print(f"Centroid MZ: {trace.getCentroidMZ()}")
        print(f"Centroid RT: {trace.getCentroidRT()}")
        print(f"Size: {trace.getSize()}")
        print("--------------------")    

def run(self):
    """Run the feature detection process"""
    self.load_file()
    self.get_dataframes()
    self.mass_trace_detection()

And here is the results. The most intense ion at m/z 148.57 can be get at the step self.ion_df = self.exp.get_ion_df(), but the mass trace on 148.57 can not get at the step of mass_trace_detection. image

xieyying avatar Jun 16 '24 10:06 xieyying

@xieyying I sent a request to access your example data yesterday, could you please accept that? I will take a look.

axelwalter avatar Jun 18 '24 06:06 axelwalter

The data can be directly downloaded without need to request. https://drive.google.com/drive/folders/11Nz3HQvyH0_yk3DuL4DoW81iQuAJ2JQV?usp=drive_link

For this data, I try the following parameter in the step of mass_trace_detection. mtd_par.setValue('trace_termination_criterion', 'sample_rate') mtd_par.setValue('min_sample_rate', 0.01) I can find the expected m/z, but not using other 'min_sample_rate‘(0.1-1) It seems the parameter of 'trace_termination_criterion' affects the result significantly. But, what real meaning of trace_termination_criterion? I can not get its real meaning from the instruction of these parameters. And in another analysis of simulated LC-MS spectrum generated from hundreds of formula, we also encountered this problems. Many features can not be found. And the parameter of 'trace_termination_criterion' seems to affect the result significantly. In this case, the most numbers of features were found using min_sample_rate', 0.1

xieyying avatar Jun 18 '24 09:06 xieyying

@xieyying Which of these files did you use in your example?

axelwalter avatar Jun 18 '24 11:06 axelwalter

trace_termination_criterion:

  • outlier: Uses 'trace_termination_outliers' and 'min_sample_rate'. Stops tracing from the apex when n='trace_termination_outliers' consecutive scans do not have a peak at an m/z that is in the range of the current weighted mean m/z. Then also filters traces where num_scans_with_peak_in_range / num_scans < min_sample_rate.

  • sample_rate: Uses 'min_sample_rate' only. Stops tracing and discards when num_scans_with_peak_in_range / num_scans < min_sample_rate.

jpfeuffer avatar Jun 18 '24 11:06 jpfeuffer

In your small example, the problem might be that there are not enough scans to the left of the apex. Please try to run the example on a larger area around that feature/trace. (untested conjecture)

jpfeuffer avatar Jun 18 '24 11:06 jpfeuffer

What happens if you keep "trace_termination_criterion" as "outlier" but reduce the number of outliers with "trace_termination_outliers" to zero? The algorithm is looking for peaks left and right which have outlier m/z values. As @jpfeuffer mentioned, the main intensity traces start at the very beginning of your peak map. No chance to find the default 5 outliers to the left.

axelwalter avatar Jun 18 '24 12:06 axelwalter

@xieyying Which of these files did you use in your example? I am so sorry, this is the correct link.

https://drive.google.com/drive/folders/1CZBRSU_hdhq7UHXdqJYQSrMUVcSfarGy?usp=drive_link

xieyying avatar Jun 18 '24 12:06 xieyying

What happens if you keep "trace_termination_criterion" as "outlier" but reduce the number of outliers with "trace_termination_outliers" to zero? The algorithm is looking for peaks left and right which have outlier m/z values. As @jpfeuffer mentioned, the main intensity traces start at the very beginning of your peak map. No chance to find the default 5 outliers to the left.

Combination of "outlier" and "zero" still can not provide correct result. Only min_sample_rate as 0.01 was added can generate the expected result.

xieyying avatar Jun 18 '24 12:06 xieyying

The issue is your peak map contains only data related to that simulated metabolite, but no noise at all (not realistic). Once you add some noise left and right it will work as expected.

axelwalter avatar Jun 18 '24 12:06 axelwalter

@axelwalter Ok. But can't this be disabled somehow with the correct snr and intensity settings?

jpfeuffer avatar Jun 18 '24 13:06 jpfeuffer

The issue is your peak map contains only data related to that simulated metabolite, but no noise at all (not realistic). Once you add some noise left and right it will work as expected.

Yes,I add some random noise and set a suitable noise threshold. The default values for the parameter of "trace_termination_outliers" worked well and I got my expected feature.

@axelwalter Ok. But can't this be disabled somehow with the correct snr and intensity settings?

Yes, when I set higher "noise_threshold_int", the situation goes back and the expected can not be found.

xieyying avatar Jun 18 '24 13:06 xieyying

I was actually referring to the chrom_peak_snr setting. Noise_threshold_int is just a simple minimum intensity cutoff.

There must be a way to find this trace without noise, otherwise I would consider it a bug @axelwalter

jpfeuffer avatar Jun 18 '24 17:06 jpfeuffer

So I generated my own test feature (Gaussian distribution) without noise and any peaks around it. Default settings work just fine. Not sure why that was not the case for @xieyying data. Perhaps because of the strong tailing.

axelwalter avatar Jun 19 '24 13:06 axelwalter

Hmm but the MassTraceDetection alone does not care about intensities other than for the weighted mz of the trace and the order in which it seeds. Maybe if it is so tailed that it does not pick up enough peaks to the left side? It must be really few though. I think it tries at least 5 scans in each direction. And if the min_sample_frequency is 0.5 this means 3 peaks need to be found to each side.

But glad that it works in general.

jpfeuffer avatar Jun 19 '24 15:06 jpfeuffer