tidyms
tidyms copied to clipboard
exposing more flexibility in `_mzml._build_offset_list_non_indexed`
Hi @griquelme - first, thanks for writing tidyms
, it's a great package. I have been searching for a usable python library to build with.
The issue I am currently having is in parsing my mzml files. My LCMS stack has an Agilent LC (1290 I) and a Thermo MS (Orbitrap QE). As a consequence of this setup, the raw files and mzml files that are written by my system contain additional spectrum
elements containing UV/VIS, PDA, and pressure data (in addition to the expected M/Z and intensity data).
When _mzml._build_offset_list_non_indexed
searches for spectrum offsets the spectrum_regex
identifies these non-MS data spectra. As a result, the returned spectrum_offset_list
contains too many elements. In my test data it returns 5891 offsets. This is the sum of the offsets associated with my recorded MS scans (2891) and the offsets associated with my PDA detector data (3000).
This causes an error when data gets read by fileio.MSData.get_spectrum
. In my case, this error is produced because the parser doesn't find a string name for the wavelength data array from the PDA. Below is a print out of the data from the spectrum iterator. It fails when it encounters the None
name for the PDA wavelength data.
{'ms_level': 1, 'polarity': 1, 'time': 960.1218000000001, 'mz': array([ 99.00540736, 99.00564096, 99.00587457, ..., 1515.13452207,
1515.14850722, 1515.16249256]), 'spint': array([0., 0., 0., ..., 0., 0., 0.]), 'is_centroid': False}
{'time': 0.40000000002, None: array([190., 192., 194., 196., 198., 200., 202., 204., 206., 208., 210.,
212., 214., 216., 218., ......]), 'spint': array([ 2.18286e+05, -4.90926e+05, -3.30683e+05, -3.09083e+05,
-5.18884e+05, -5.05399e+05, -2.69533e+05, -2.26293e+05,
-1.50526e+05, -1.45612e+05, -1.12733e+05, -1.12920e+05,
-8.19990e+04,..........
-1.27050e+04, -1.08800e+03, -1.57300e+04, 2.60700e+03,
-1.47880e+04, -8.19600e+03, 3.62600e+03, -7.22200e+03,
6.90000e+03, -1.40910e+04, -4.10000e+02, -9.73000e+03,
-1.78600e+03, 2.84300e+03]), 'is_centroid': False}
My current workaround is just to alter the spectrum_regex
definition
# spectrum_regex = re.compile("<spectrum .[^(><.)]+>")
spectrum_regex = re.compile('<spectrum index=\"[0-9]+\" id="controllerType=0')
Helpfully, Thermo appears to write anything that isn't Thermo as controllerType=4
so I can just exclude the Agilent PDA data.
Ultimately, it would help to be able to specify this regex or pass additional parameters to avoid this situation. I am happy to submit a PR, but need a little help figuring out how the situations work with indexed and non-indexed mzml files. I am fairly unfamiliar with the mzml format.