tidyms exposing more flexibility in `_mzml._build_offset_list_non

exposing more flexibility in `_mzml._build_offset_list_non_indexed`

Open wdwvt1 opened this issue 1 year ago • 3 comments

Hi @griquelme - first, thanks for writing tidyms, it's a great package. I have been searching for a usable python library to build with.

The issue I am currently having is in parsing my mzml files. My LCMS stack has an Agilent LC (1290 I) and a Thermo MS (Orbitrap QE). As a consequence of this setup, the raw files and mzml files that are written by my system contain additional spectrum elements containing UV/VIS, PDA, and pressure data (in addition to the expected M/Z and intensity data).

When _mzml._build_offset_list_non_indexed searches for spectrum offsets the spectrum_regex identifies these non-MS data spectra. As a result, the returned spectrum_offset_list contains too many elements. In my test data it returns 5891 offsets. This is the sum of the offsets associated with my recorded MS scans (2891) and the offsets associated with my PDA detector data (3000).

This causes an error when data gets read by fileio.MSData.get_spectrum. In my case, this error is produced because the parser doesn't find a string name for the wavelength data array from the PDA. Below is a print out of the data from the spectrum iterator. It fails when it encounters the None name for the PDA wavelength data.

{'ms_level': 1, 'polarity': 1, 'time': 960.1218000000001, 'mz': array([  99.00540736,   99.00564096,   99.00587457, ..., 1515.13452207,
       1515.14850722, 1515.16249256]), 'spint': array([0., 0., 0., ..., 0., 0., 0.]), 'is_centroid': False}
{'time': 0.40000000002, None: array([190., 192., 194., 196., 198., 200., 202., 204., 206., 208., 210.,
       212., 214., 216., 218., ......]), 'spint': array([ 2.18286e+05, -4.90926e+05, -3.30683e+05, -3.09083e+05,
       -5.18884e+05, -5.05399e+05, -2.69533e+05, -2.26293e+05,
       -1.50526e+05, -1.45612e+05, -1.12733e+05, -1.12920e+05,
       -8.19990e+04,..........
       -1.27050e+04, -1.08800e+03, -1.57300e+04,  2.60700e+03,
       -1.47880e+04, -8.19600e+03,  3.62600e+03, -7.22200e+03,
        6.90000e+03, -1.40910e+04, -4.10000e+02, -9.73000e+03,
       -1.78600e+03,  2.84300e+03]), 'is_centroid': False}

My current workaround is just to alter the spectrum_regex definition

# spectrum_regex = re.compile("<spectrum .[^(><.)]+>")  
spectrum_regex = re.compile('<spectrum index=\"[0-9]+\" id="controllerType=0')

Helpfully, Thermo appears to write anything that isn't Thermo as controllerType=4 so I can just exclude the Agilent PDA data.

Ultimately, it would help to be able to specify this regex or pass additional parameters to avoid this situation. I am happy to submit a PR, but need a little help figuring out how the situations work with indexed and non-indexed mzml files. I am fairly unfamiliar with the mzml format.

Mar 06 '23 21:03 wdwvt1

tidyms tidyms copied to clipboard

exposing more flexibility in `_mzml._build_offset_list_non_indexed`

tidyms
tidyms copied to clipboard