Py-GC/MS data validation issues for PFAS polymers
Hi everyone,
My name is Lena and I’m from Umeå University. I’m trying to upload the first MassBank record for a PFAS polymer (PTFE), measured using Py-GC/MS in double-shot mode, but I’ve run into several validation issues:
-
It seems that the validator does not recognize Py-GC/MS as a valid instrument type or method. I’ve tried different variations like Pyrolysis GC/MS, Py-GC-MS, and GC-EI-MS, but all were rejected.
-
I also noticed that molecular formulas for polymers, like (C2F4)n or even -C2F4-, are not supported. If I provide just the monomer unit (C2F4), I can’t add a comment to explain that it represents the repeat unit. The same issue occurs with the exact mass, which I calculated and entered the value for the monomer unit, but there was no way to clarify that in the record.
-
Additionally, the validator does not accept pyrolysis-specific metadata such as AC$PYROLYSIS, even though it’s important for this technique. Only MS and chromatography parameters seem to be allowed.
I’d really appreciate any guidance on how to proceed or whether support for polymer/Py-GC/MS data is planned.
Thanks!
PS: I am attaching the test file I created. MSBNK-UmU-UMU000001.txt
Hi Lena, Thanks for reaching out to us. I'm not aware that we have any reference data from Pyrolysis GC-MS in our data repo, so we need to figure out how we want to include that kind of data. We need to discuss this internally, and I will get back to you with suggestions. Of course, I will also need to include those fields in the validator then. Thank you also for the example you prepared. Best, René
Looking at the example record, I was blown away by the number of names given, and didn't know whether this is one compound or many ? The PubChem Link https://pubchem.ncbi.nlm.nih.gov/substance/9898279 points to a legacy record, and the compound there has a molecular formula C22H26N4O3S2. https://commonchemistry.cas.org/detail?cas_rn=9002-84-0 is more like it :-)
Also, with Polymers you usually get a mass distribution, is that correct ? So here we have the monomer mass. And then there is mono- and hetero polymers, where the chemical identifiers can get interesting. We do support extended SMILES to some extend. https://www.simolecule.com/cdkdepict/depict/bot/svg?smi=C(F)(F)C(F)(F)%20%7CSg%3An%3A0%2C7%3ASRU%3Aht%7C&w=-1&h=-1&abbr=on&hdisp=S&zoom=1.3&annotate=none&r=0
Coming back to PyGC, I see that there is a 12 second pyrolysis happening, and then there is a 14min GC gradient. Is it important at which GC retention time the resulting spectrum was measured ? And, if you had a heteropolymer, would that then result in two related spectra for one polymer for the different monomers ? Would they elute at different GC retention times ? Then we'd need to come up with two titles like "Teflon-C2F4 monomer" and "Teflon-N2F4 monomer" (and yes, that's totally impossible and made-up).
I'd like to avoid the AC$PYROLYSIS tag, since my inkling is that this could be considered part of the sample processing, and we might have many other techniques with descriptions, causing a proliferation of tags. What about instead going AC$SAMPLE: PYROLSYIS ... approach ? Could you inform us whether anyone would actually search for those values ? Is double-shot as important as e.g. polarity in LC/MS ? Or, if you measured a double-shot, would you be happy to have high spectral similarity to a triple-shot spectrum (if that exists ...) ?
E.g.
AC$SAMPLE: PYROLYSIS_MODE Double-shot
AC$SAMPLE: PYROLYSIS_THERMAL_DEPOSITION 100 °C (1.00 min) → 20 °C/min → 300 °C (1.00 min), total 12.00 min
AC$SAMPLE: PYROLYSIS_TEMPERATURE: 650 °C (0.2 min)
I am super-open to a better term for AC$SAMPLE, which feels too generic and overloaded.
Yours, Steffen
Hi Steffen,
Thanks for catching that I inserted the wrong SID. The synonyms came from PubChem and are commercial names for PTFE (CAS 9002-84-0). PubChem has multiple substance records for the same polymer (SIDs 134991388, 471947371, and more), likely from different depositors. I’m not sure how PubChem validates or merges these, but there do seem to be many near-duplicates for PTFE. We could keep only the most relevant ones and move the rest to an annotation.
You’re right - the PubChem link points to a small-molecule example, which doesn’t fit well for a polymer. For PTFE we usually use the repeat unit notation (C(F)(F)C(F)(F)) rather than a fixed molecular formula.
Polymers do have a mass distribution, but with Py-GC/MS the molecular weight or number of monomer units doesn’t really matter because you get the same fingerprint and mass spectrum. I’ve done both single-shot and double-shot runs. I decided to upload the double-shot because it pyrolyzes the pure polymer without additives. During manufacturing, polymers can adsorb processing additives as well as unreacted oligomers, trimers, etc. In the first shot we remove these additives, and in the second shot we pyrolyze the pure polymer. I also tested two PTFE samples: one pure polymer (99%) and one Teflon from a non-stick oven pan, and obtained the same results. My plan is to upload separate MS records for the additives.
When it comes to pyrolysis metadata, I like your idea of putting pyrolysis details under AC$SAMPLE instead of making a new tag. Your format works well.
Yes, retention time matters. Retention time also varies with the GC column type and length. For PTFE we used the scan at 1.865 min, which is the apex of the monomer peak. For heteropolymers and more complex materials, the pyrolysate is a mixture: some monomer-derived fragments may be distinct, but many are shared or co-elute, and side-chain scission products and small oligomers can appear as well - so you don’t necessarily get one unique peak per monomer.
Happy to update the PTFE record so it shows the repeat unit, trims the synonym list, corrects the SID, and adds these AC$SAMPLE details if that works for you.
Best, Lena
FYI the "proper" PubChem page for this is https://pubchem.ncbi.nlm.nih.gov/compound/Polytetrafluoroethylene - yes PubChem SIDs are redundant, because we can get the same chemical from multiple different organizations. Normally these are combined by chemical structure into the compound page, but for polymers where it's not a discrete chemical, it's more complicated. If you need to use an SID, the best would be 481110317 which is the "raw" reference record but without other associated annotation that you see in that first URL.
Thanks @PaulThiessen - as a follow-up, is there any way currently that @meier-rene could validate that the SID provided is your preferred SID for records like this? Then we could try to include this in our record validation (and/or insert the preferred SID if not provided).
Thanks so much, @PaulThiessen! I’ll be using SID 481110317. Could you give me some advice on how to identify a “raw” reference in PubChem for polymers? For example, when I search for “polytetrafluoroethylene” I get 112 hits, and for “PTFE” I get 141 hits. This would be very helpful because I’m planning to upload MS data for other PFAS polymers as well.
Right now, for CH$FORMULA I used what’s listed for PTFE under SID 481110317, which is (C2F4)x. I think this is a good representation, but the formula didn’t pass validation. There’s always the option to use monomer units, but the issue is that for some polymers the backbone is connected to one or more end groups, so we can’t clearly show which part is the polymerized backbone.
For example: Poly(difluoromethylene), .alpha.-hydro-.omega.-(2,2-dichloro-2-fluoroethyl)(https://pubchem.ncbi.nlm.nih.gov/substance/472204870). Here, I would represent the molecular formula as H(CF2)xCH2CFCl2 or H-(CF2)x-CH2CFCl2, but that also doesn’t pass validation.
The validation is probably not currently designed to handle these special case formulas yet so don't worry about failing that for now - this will be something we can hopefully adjust with @meier-rene
do you know if there is a standardized way to check polymer formulas? @PaulThiessen do you have one your side that we could use to be consistent, or are you reliant on user contributions for these special cases?
I know currently the NORMAN-SLE provides some detailed formula annotation to PubChem from specific contributors for PFAS, so one approach could be to relax the formula check for the cases where polymers are provided, so that we can form a collection of polymer formulas, if there's not a standardized workflow to check them yet?
@LadyBooo @schymane There are a couple of ways to do that. First, if you search for a term e.g. https://pubchem.ncbi.nlm.nih.gov/#query=polytetrafluoroethylene, hopefully the "best match" will be the right one. If you click on that and it's not a normal compound page (meaning, it says "CID not available because it's not a discrete structure), the PubChem Reference SID will be shown in the top box. Alternatively, you can see all reference SID matches by going to the substance tab, click "Filters" and select "PubChem Reference Collection": https://pubchem.ncbi.nlm.nih.gov/#query=polytetrafluoroethylene&tab=substance&sidsrcname=PubChem%20Reference%20Collection. In general, if you go to a substance page via SID, you can tell it's this "special" record if it says "Source - PubChem Reference Collection" in the top box.
Using PUG REST we could get only PubChem reference collection SIDs (not all options) with something like this:
https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/name/polytetrafluoroethylene/sids/JSON?sourcename=PubChem+Reference+Collection
Hi, I find the PubChem SID discussion a bit off-track, since I don't expect that to be different from other MassBank records. So for that we could open a separate issue if needed. Instead, I'd like to focus on the Pyrolysis aspects of the record. The example provided needs the GC retention time, and then I'd need confirmation that we have all experimental Pyrolysis paremeters needed.