pyteomics icon indicating copy to clipboard operation
pyteomics copied to clipboard

Should we use psims.obo to better control param value types in PSI formats?

Open levitsky opened this issue 9 months ago • 1 comments

I came across a corner case when parsing an MzIdentML file which has spectrum title filled in with bare scan numbers (integers):

<cvParam name="spectrum title" value="34" cvRef="PSI-MS" accession="MS:1000796" />

The expected value type for this parameter would of course still be str, but the current logic in XML._handle_param only looks at the type attribute, which is not there:

https://github.com/levitsky/pyteomics/blob/24a2037665e929adace6f1491bd1ff1721362bf5/pyteomics/xml.py#L363-L369

Because float conversion succeeds in this case, spectrum title is returned as a float.

I don't know how many of such corner cases exist with other cvParams, but one way to fix this more or less systematically would be to invoke the PSI-MS ontology. Would it be wise to add psims as an extra dependency for [XML] and use OBOCache and OBOParser to look up terms there in _handle_param (at least when type attribute is not provided)?

@mobiusklein your input would be much appreciated. I feel like we could have discussed it at some point but I couldn't find it except for a discussion related to unit awareness.

levitsky avatar Mar 07 '25 12:03 levitsky

There are an uncountable number of cases like this. All it takes is for a parameter whose value may be any string to be a string that happens to be a number.

When I wrote XSD type parsing into psims, I had this idea in mind, but I don't remember why I didn't go for it way back in the PR you linked.

This is trickier though because type coercion from a string has a certain hierarchy (e.g. float is more stringent than int, int more stringent than str). The OBO file does not express this hierarchy, so psims would need to handle that, and it currently does not. This is doable, I'd just need an hour to figure out the ways it can go horribly wrong, test it, and make sure a fallback strategy is in place.

This can also introduce wide-spread breaking changes. For instance, consider preset scan configuration, whose value type is a string. Routinely the value is just an integer, since it corresponds to a "Scan Event" ID as defined by Thermo, or a "Scan Function" number as defined by Waters, the way existing code expects it to be parsed is as a float. It's even in the test data: https://github.com/levitsky/pyteomics/blob/24a2037665e929adace6f1491bd1ff1721362bf5/tests/data.py#L1332-L1339

This isn't quite accurate since it's now a float, not an integer, but because float and int are compare-able, we can write config == 1 and be understood correctly. If we make this change, now config is a string, and config == 1 is always going to be False and we would only know this happened because of downstream code breaking later on. I do not know how many people would be impacted by this.

We could make it opt-in at a reader instance level, instead of forcing a breaking change, of course.

mobiusklein avatar Mar 08 '25 19:03 mobiusklein

@mobiusklein Thank you for the input. I'm not sure I understand the part about type conversion hierarchy, this seems like it would be needed if we want to increase the amount of guesswork done when handling those values, whereas I am hoping to limit it. What is the conversion logic you have in mind? I was thinking that we could keep the current behavior as default but first check if the parameter name is present in PSI-MS and if it has has_value_type defined (or its parents do, I'm not sure if this info is inherited automatically or should be inherited at all). If it does, we would just try to convert to the corresponding type and fall back to string, or something along those lines.

I feel like a breaking change would be warranted because otherwise the fix might just not reach that far. The erroneous behavior we have now is also manifested in third-party code where it potentially breaks things, e.g. the current case I have I noticed in a piece of software that was using psm_utils, which was returning PSM objects with floats for spectrum IDs. If we make the fix opt-in, getting the correct behavior in the end-user software would require fixing all three projects instead of one. My own case might skew my point of view a lot here, of course.

levitsky avatar Mar 10 '25 12:03 levitsky

What I mean by "type hierarchies" is that a small number of parameters can have multiple types. Admittedly, 100% of them are found in formats that aren't parsed by pyteomics. Right now, the "canonical cascade" of value type conversion would be int -> float -> str (or more generically, json.loads), in ascending order of strictness about content. If we had to handle those parameters which could be multiple types, we would need to infer the order in which to try to convert to which type in order to avoid arbitrarily choosing to convert to str, having it work "always", and then never trying the other types. Conversely, such parameters would essentially never be able to have numeral-only strings without some kind of escaping.

If we ignore those parameters, then what you're describing can work out-of-the-box already. psims handles that already.

mobiusklein avatar Mar 11 '25 00:03 mobiusklein