ThermoRawFileParser icon indicating copy to clipboard operation
ThermoRawFileParser copied to clipboard

Is it possible to extract pressure profiles as csv file in batch?

Open AnStaes opened this issue 2 years ago • 8 comments

Hi,

With the --allDetectors options I can get the pressure profile of a raw file into the mzML file, but what I need is actually a csv file with the pressure values per RT. Is it possible to extract this in batch?

Thank you!

AnStaes avatar Mar 07 '22 14:03 AnStaes

Hi @AnStaes, thank you for using ThermoRawFileParser. At the moment you can only process raw files into mzML in batch and extract pressure profile from them. A pressure trace is represented as a chromatogram, it is two arrays of 64bit-floats (zipped by default, but you can use -z to disable compression) in base64 encoding. The feature to extract this data to some plain text format can be added later.

caetera avatar Mar 07 '22 20:03 caetera

Hi again, you are welcome to use the python script (see the gist link below) that does the extraction of the pressure chromatograms from mzML file; it can be invoked with a list of mzML files Apart from standard libraries it depends on pandas (CSV output) and numpy (convert binary buffer to numeric array), and lxml (obviouisly). It is a bit sketchy, feel free to adapt for your needs if necessary.

https://gist.github.com/caetera/0921b33f0c6201a538436906cc965fff

caetera avatar Mar 08 '22 19:03 caetera

Hi @caetera - I am trying to use the gist you shared to do some processing of my own, but I am having trouble getting it to work.

If I use Freestyle I can extract the pressure curves from the samples, so I know this information must be in the raw files. My process has been to convert the files to .mzML using the either the GUI or the command line version. I specify 'All Detectors' option (or -a flag). The resulting .mzML data has binary64 arrays for my PDA data and for the MS scans themselves, but not for the pressure curves.

I have tried modifying line 19 to include other accession numbers (based on these but to no avail.

My setup is a QE with an Agilent 1290 LC stack. I have included a metadata file from the the sample I am testing if that is helpful.

Any help would be greatly appreciated!

s001-metadata.txt

wdwvt1 avatar Nov 17 '22 03:11 wdwvt1

Hi @wdwvt1,

Thank you for using TRFP. Unfortunately, there is no special way to determine a pressure trace in RAW files, they might be written as, so-called, A/D traces, UV traces, or PDA traces. It seems different LC systems do it differently. TRFP should extract all UV and PDA chromatograms, and also A/D traces that contain "pressure" in the name. The rationale was that as soon as we don't know what sort of data is stored in A/D we don't want to produce potentially misleading results. I do not have a lot of experience with Agilent systems in that context. What you can try first is to drop the cvParam filter in the script (the first xpath expression). That will extract all chromatograms present in the file, they should be named similarly to what you can see in FreeStyle. It is possible, that you will find the necessary ones, later you can substitute the cvParam filter with the name filer. Please note, that UV and PDA chromatograms will be written as absorbance chromatogram (with the corresponding CV term), even if they in fact contain pressure values. Otherwise, you can just inspect the mzML file itself, all chromatograms are located at the end in chromatogram element, to see if you can find the necessary ones, again, you should be able to identify these by the name similar to the one visible in FreeStyle. Finally, you can write here what is the Detector type and Name (both are important) that are visible in FreeStyle, with this information I can check if TRFP needs to have a more sophisticated filter to extract them all. After all, you can share a RAW file, pretty much any, just so it will be representative for the system and I can inspect it all myself.

caetera avatar Nov 17 '22 11:11 caetera

Hi @caetera - thanks very much for the help! I did a little more digging and here's what I've found.

I extracted the raw file using TRFP as follows (the logging level made no difference, just making sure)

mono ThermoRawFileParser.exe -i /Users/wdwvt/Desktop/s001.raw -z -a -l 1

Using the etree parser I've looked at the output a bit. Depending on what kind of data element I look at (//chromatogram/, //chromatogramList/, //spectrum/, //spectrumList/) I get different arrays of data. I can identify the m/z array, m/z intensity array, various scan timing arrays (for my UV detector, PDA, and the MS). I believe the PDA and UV absorbance data are also in there, though I don't fully understand that output.

I can't find any sign of the pressure data, though I can't be sure I am examining the right elements. If I just extract the tags of all elements I get (I've removed the prefix {http://psi.hupo.org/ms/mzml} from all of them):

analyzer
binary
binaryDataArray
binaryDataArrayList
chromatogram
chromatogramList
componentList
cv
cvList
cvParam
dataProcessing
dataProcessingList
detector
fileChecksum
fileContent
fileDescription
index
indexList
indexListOffset
indexedmzML
instrumentConfiguration
instrumentConfigurationList
mzML
offset
processingMethod
referenceableParamGroup
referenceableParamGroupList
referenceableParamGroupRef
run
scan
scanList
scanWindow
scanWindowList
software
softwareList
softwareRef
source
sourceFile
sourceFileList
spectrum
spectrumList

I am guessing that you were right when you suggested the .raw file has the pressure information in it in a way that is not visible to either the parser or writer of TRFP. Could it be that the block of code on line 954 of ThermoRawFileParser/Writer/MzMlSpectrumWriter.cs isn't sufficiently flexible?

Finally, you can write here what is the Detector type and Name (both are important) that are visible in FreeStyle, with this information I can check if TRFP needs to have a more sophisticated filter to extract them all.

The detector and trace types of interest in Freestyle for the pump are as follows.

Detector Type	Agilent1290 G4220 Bin Pump 1
Trace Type	1 - Flow Rate (ul/min)
		2 - Composition A (%)
		...
		8 - Pressure (bar)
		...

There are 10 Trace Types (all in the format above int - str).

Thanks again for your help. I would be happy to contribute to the repo/create something to more flexibly parse these files if you can help me understand them. I can also send you the .raw file if that will help.

wdwvt1 avatar Nov 23 '22 00:11 wdwvt1

Hi @wdwvt1, thank you for the investigation. From what you reported, it looks like Agilent creates a new so-called device in the raw file, since there are only several types of devices to select from (for example, MS, PDA, UV, and none like pump or agilent), I assume we need to look into Other or, maybe, MSAnalog category. TRFP ignores these two mentioned categories. It is not that easy to give precise instructions on how to implement it, since it might require a try, fail, and learn approach. The code that outputs all non-MS chromatograms, including pressures can be seen here: https://github.com/compomics/ThermoRawFileParser/blob/e338347f7250dc9e0727f7dda1d398ca37a15312/Writer/MzMlSpectrumWriter.cs#L863-L1004
Basically, it checks the selection of Device types and then iterates through the channels (not all devices have them though) collecting all channels that make some sense.

It all might look too convoluted, but you can share a representative file with me using the link below and I will see what can be done https://filesender.deic.dk/?s=upload&vid=61fe1192-a050-6aac-e76a-9de9432058a0

To the best of my knowledge, FreeStyle uses the same "engine" as TRFP under the hood, thus, what is visible in FreeStyle should be (in principle) accessible for TRFP.

caetera avatar Nov 24 '22 17:11 caetera

I just uploaded the .raw file.

I feel very stupid, but there is one thing I forgot to mention that is probably important. Using standard FreeStyle, we could not extract the pressure chromatograms (or anything) from the Agilent LC system. We could extract the PDA/UV information and MS information only. We had to purchase a "SII 3rd party license" from Thermo (the download was listed as SII: INSTRUMENT CLASS 3 CONTROL PACKAG). For files generated prior to the installation of this software package I can still extract the pressure information using FreeStyle suggesting that the information is getting written regardless and the software package just contains code necessary for extraction/display.

I am unfamiliar with C#, but have begun messing around with the files. Am I correct in understanding that in the block you referenced, I should be modifying the catch-all on line 954 so that it will allow Analog, MSAnalog and Other? It also seems that line 964 should be modified since it's not clear that the channel will be correctly labeled "pressure" (or since it comes out in FreeStyle that way, it actually will be labeled with pressure as long as the correct device can be found)?

https://github.com/compomics/ThermoRawFileParser/blob/e338347f7250dc9e0727f7dda1d398ca37a15312/Writer/MzMlSpectrumWriter.cs#L954-L1007

Thanks again for all your help!

wdwvt1 avatar Nov 25 '22 20:11 wdwvt1

File received. Some good news, I think all the data you want is "visible" without any special packages added to the system. It has to be obtained in a significantly different way, however.

The file you provided reports the following number of devices:

MS - 1
MSAnalog - 0
Analog - 0
UV - 1
Pda - 1
Other - 4
None - 4

More information on the devices in Other (some explanations below):

Other device 1 info
Name - Agilent1290 G4226 HiP-ALS
Model - A1
SerialNumber - A1
SoftwareVersion - 3.2.0 SP2
HardwareVersion - A1
ChannelLabels - System.String[]
Units - Volts
Flags -
AxisLabelX -
AxisLabelY -
IsValid - True
HasAccurateMassPrecursors - False

Other device 1 log items number: 1

Other device 1 log item names
dVolume (ml);VialNumber;VialNumber;VialNumber;VialNumber;Not Ready Code;Temperature Control On;Temperature (C)

Other device 1 log item plotable
dVolume (ml);VialNumber;VialNumber;VialNumber;VialNumber;Not Ready Code;Temperature Control On;Temperature (C)


Other device 2 info
Name - Agilent1290 G4212 DAD
Model - A1
SerialNumber - A1
SoftwareVersion - 3.2.0 SP2
HardwareVersion - A1
ChannelLabels - System.String[]
Units - Volts
Flags -
AxisLabelX -
AxisLabelY -
IsValid - True
HasAccurateMassPrecursors - False

Other device 2 log items number: 1200

Other device 2 log item names
General Lamp State;UV Lamp State;Vis Lamp State;Acquisition State;Free Buffer Size;Used Buffer Size;Not Ready Code

Other device 2 log item plotable
General Lamp State;UV Lamp State;Vis Lamp State;Acquisition State;Free Buffer Size;Used Buffer Size;Not Ready Code


Other device 3 info
Name - Agilent1290 G1316 TCC
Model - A1
SerialNumber - A1
SoftwareVersion - 3.2.0 SP2
HardwareVersion - A1
ChannelLabels - System.String[]
Units - Volts
Flags -
AxisLabelX -
AxisLabelY -
IsValid - True
HasAccurateMassPrecursors - False

Other device 3 log items number: 1199

Other device 3 log item names
Left Termperature (C);Right Temperature (C);Valve Position;Oven State;Not Ready Code

Other device 3 log item plotable
Left Termperature (C);Right Temperature (C);Valve Position;Oven State;Not Ready Code


Other device 4 info
Name - Agilent1290 G4220 Bin Pump 1
Model - A1
SerialNumber - A1
SoftwareVersion - 3.2.0 SP2
HardwareVersion - A1
ChannelLabels - System.String[]
Units - Volts
Flags -
AxisLabelX -
AxisLabelY -
IsValid - True
HasAccurateMassPrecursors - False

Other device 4 log items number: 1199

Other device 4 log item names
Flow Rate (ul/min);Composition A (%);Composition B (%);Composition C (%);Composition D (%);Hi Pressure (bar);Pump Mode;Pressure (bar);Pump Type;Not Ready Code

Other device 4 log item plotable
Flow Rate (ul/min);Composition A (%);Composition B (%);Composition C (%);Composition D (%);Hi Pressure (bar);Pump Mode;Pressure (bar);Pump Type;Not Ready Code

Other devices according to the docs provide (at most) two data types - instrument info and status log. Instrument info (obtained through _rawFile.GetInstrumentData()) contains (as seen above) some basic data, such as name, version etc. Status log is, in a nutshell, an array of structured objects; the structure of an individual object can be obtained by _rawFile.GetStatusLogHeaderInformation(), it contains (above other) data type, byte size, and name of every log item. Some of the log items are plottable, the list of these can be obtained through _rawFile.StatusLogPlottableData property - it returns a dictionary with keys as log item names and values is the integer index of the log element. The index can be provided to _rawFile.GetStatusLogAtPosition method, it will return two arrays: first one (double type) is times, and second one (string type) is the values of the corresponding log elements throughout the log. For plottable log elements one should be able to convert all string values in data array to double values. Another option to access log elements is to iterate through log entries (total number them is returned by _rawFile.GetStatusLogEntriesCount()), each entry has the Time property, and Values property. The latter is an array (object type) of raw log entries values. The properties of the objects are stored in the HeaderInformation (see above). Thus, knowing the index of the log element, one can get raw log values.

Thus, the include the pressure traces one need to obtain them from log entries by either of the two methods above (it is, however, to be determined which one is better), convert if necessary to two double arrays, and use these to create a new chromatogram object. The latter needs to be added to the list of chromatograms in mzML file object.

I will try to formulate all this into code in the future (still need to think how to do it in the least redundant and the most general way), but contributions are always welcome.

Also need to check what sorts of chromatograms are allowed in the mzML format (i.e. is it possible to include other plottable items, such as flow rates and so on).

Off-note, never seen None devices before, when trying to access any of them everything just crashes. I wonder if they might contains some additional vendor-specific data (accessible only with the plug-in you mentioned).

caetera avatar Nov 26 '22 00:11 caetera