ThermoRawFileParser
ThermoRawFileParser copied to clipboard
Parquet format: add polarity, and maybe change to long format
I miss the ionization mode in the parquet format. Did you consider not storing the intensities and masses as arrays but exploding them? I find the data much easier to analyse if the data is in long format. And because parquet is compressing the data, it should not blow up the file size to much.
Hi @soerendip I should have read this issue before replying to your other one, thanks for your input. @ypriverol, are you using the parquet format output? I think you're more qualified to answer this one. We could add another parquet output option or add yet another flag.
Hi @nielshulstaert is it worth considering moving format options into sub-commands? I.e. similarly to xic and query we introduce mzml, mgf, and parquet options to switch the format and substitute -f
flag? That might help with too many flags. It will be, without any doubt, a significant change in the interface, but if properly "announced" should be possible. What do you think?
I not sure what is the best for the community. There is also the mzMLb format that was recently announced. The strength of the parquet format is its columnar format, which I think is undermined when you compress the data into arrays stored in the cells. I will run some tests next week and get back to you.
I made a test with 12 small metabolomics files converted to different formats. For mzXML and mzMLb I used the parsers from
pyteomics
and for mzMLb from pymzml
. For parquet and feather I used Pandas
and pyarrow
. The orange is just reading the data into memory and the blue one is reading and converting into long format. parquet and feather have almost the same read speed and the feather the data is already in the long format. I was puzzled that the mzMLb format takes so much time to read. Maybe that is just an inefficient parser.
I wonder what are the advantages of the m/z values and intensities stored as an array from your point of view. Are there ways of accessing the data that make it better? I wonder in case you want to slice by m/z rather than by retention time, like extracting a peak over a period of time, it would be much overhead to use the denser format. Or am I wrong? Did you do any benchmarking?
Hi @soerendip :
Sorry to arrive late for this discussion. My idea here:
1- We in PRIDE are using Avro and parquet for data handling Peptide evidence storage and Spectra. 2- While we are not using the current implementation, I think it would be great to continue developing the parquet version here to enable in the future the development of new algorithms and storage systems to fast retrieve spectra. As @soerendip points it out the mzMLb is a binary mzML file format but not a major application is yet using it. I see some major advantages on parquet associated with its column format design. 3- @soerendip @nielshulstaert @caetera would be great to discuss some use cases and advantages of the design of the parquet format for the ultimate design.
Sure.
BTW you can find the files that I used.
The files were downloaded from https://www.ebi.ac.uk/metabolights/MTBLS1569/descriptors. And the converted files can be downloaded from https://soerendip.com/dl/MTBLS1569/
I used the 12 files starting with T for the test.
Apparently, the mzMLb can be faster if generated with a better compression type. It seems for some reason the compression is zip in this files.. there is another dependency that was not installed (hdf5plugin, I believe). Which supposedly can make the file faster to read, however, it did not work so far and I am not sure what is wrong. I also looked at the file sizes. That is where mzMLb is better than the former mz... formats.
Apparently, storing the data in long format blows up the parquet format (compare parquet-Mint with parquet-TRR) TRR=ThermoRawfileReader. This is just the reading time, without formating the data to long format.
And all done with Python. It would be interesting how much faster other parsers are. E.g. from OpenMS or XCMS.