SSSOM-Py inconsistency with types
(This is somewhat related to #630, but is a much larger problem.)
SSSOM-Py is woefully inconsistent about the type of the values one may found in a MappingSetDataFrame.
Depending on whether the slot is in the “metadata” part or the “date frame” part of the MSDF, and depending on whether the MSDF was parsed from a TSV file, a JSON file, or a RDF file, the same value can have completely different types, as shown in the following table:
| Slot type | TSV (metadata) | TSV (data frame) | JSON (metadata) | JSON (data frame) | RDF (metadata) | RDF (data frame) |
|---|---|---|---|---|---|---|
| date | datetime.date |
str |
str |
str |
str |
str |
| enum (e.g. entity type) | str |
PermissibleValueText |
EntityTypeEnum |
PermissibleValueText |
EntityTypeEnum |
PermissibleValueText |
| entity reference | str |
EntityReference |
EntityReference |
EntityReference |
EntityReference |
EntityReference |
Admittedly the entity reference case is a minor issue, because at least EntityReference is a subtype of str and the two types are therefore compatible. In particular, one can directly compare a str-typed entity reference value and a EntityReference-typed value, and get the expected result:
>>> from sssom_schema import EntityReference
>>> er_as_str = "FBbt:1234"
>>> er_as_EntityReference = EntityReference("FBbt:1234")
>>> er_as_str == er_as_EntityReference
True
So that’s one is merely a small annoyance, fine.
But the other two cases are much more problematic.
The “date” case should be obvious. If I have a date-typed slot, I expect to be able to manipulate its values as date objects. I do not expect to have to call datetime.strptime or similar to turn the value into an actual date. And I expect to be able to compare two “date-typed” values regardless of whether the value comes from the metadata or from the data frame, and regardless of which format the mapping set data frame was parsed from in the first place.
The “enum” case is almost comical. Not only do we have three different types of values (str, EntityTypeEnum – which is the LinkML-generated object normally supposed to represent values of the entity_type_enum –, and PermissibleValueText – which is the generic LinkML type to represent any enum value), but all three types are completely incompatible with each other. If you have three variables representing the same enum value in all three different types, all variables will be considered to be different.
Slightly related, there is also the fact that a multi-valued slot is represented as a list in the metadata part of a MSDF, but as a string containing |-separated values in the data frame part…
The use of | to separate values should be nothing more than a serialisation detail of the SSSOM/TSV format. Once the MSDF has been parsed out of a SSSOM/TSV file, there should no longer be any trace of that detail. A Pandas data frame can perfectly contain list, so there is no reason to keep multi-valued slots as flat strings within the data frame.
So I would suggest:
Since the entire concept of a “Mapping Set Data Frame” is already quite remote from the LinkML-derived MappingSet and Mapping classes, I’d propose we go all the way in cutting any link between a MSDF and the LinkML model, and instead always use standard Python types everywhere.
That is, regardless of which part of a MSDF data we are in (metadata or data frame) and of which format the MSDF has been parsed from:
- a date-typed slot is always represented as a
datetime.dateobject (not as a string); - a enum-typed slot is always represented as a plain
str(not as a LinkML*EnumorPermissibleValueText); - an entity-reference slot is always represented as a plain
str(not as a LinkMLEntityReference).
And regarding the comment just above:
- a multi-valued slot is always represented as a normal
list(not as astrwith|-separated values).
@matentzn @cthoyt I’d like to get your opinion on this.
Since the entire concept of a “Mapping Set Data Frame” is already quite remote from the LinkML-derived
MappingSetandMappingclasses, I’d propose we go all the way in cutting any link between a MSDF and the LinkML model, and instead always use standard Python types everywhere.
agreed! the LinkML model is weirdly in the middle of MSDF loading, which serves the purposes of validation, but it's not super efficient
That is, regardless of which part of a MSDF data we are in (metadata or data frame) and of which format the MSDF has been parsed from:
* a date-typed slot is always represented as a `datetime.date` object (_not_ as a string);
agreed
* a enum-typed slot is always represented as a plain `str` (_not_ as a LinkML `*Enum` or `PermissibleValueText`);
agreed
* an entity-reference slot is always represented as a plain `str` (_not_ as a LinkML `EntityReference`).
yes this makes sense. Alternatively, if it's a slot that should be a curie, a curies.Reference could also work
And regarding the comment just above:
* a multi-valued slot is always represented as a normal `list` (_not_ as a `str` with `|`-separated values).
enthusiastic agree. I've commented on this elsewhere
I agree with the proposed design - I am just a bit unsure why there is no way to configure the LinkML schema to in a way that allows us to do this - it seems like your proposal should be the default for all LinkML schemas. Maybe there are different versions of data classes in use or something.
I am just a bit unsure why there is no way to configure the LinkML schema to in a way that allows us to do this - it seems like your proposal should be the default for all LinkML schemas.
There are two different issues here: LinkML’s inherent behaviours, and the way SSSOM-Py is using LinkML to build a “MappingSetDataFrame”.
LinkML’s inherent behaviours
Date values
A slot defined in LinkML as being date-typed is represented as a string (more precisely, as a XSDDate object, which is string-based). This is by design, not a bug. Look at the code for the XSDDate class:
- the constructor always returns a
strvalue (even if it has been given adatetime.dateobject); - the validation method works by comparing a
strvalue against a regular expression (again, even if it is provided with adatetime.dateobject – in which case it will turn it into a string so that it can “validate” it).
If the code was not enough, the specification for LinkML’s meta-model is clear that the Date-typed slots are supposed to be represented as string values (repr: str, where repr is ”the name of the python object that implements this type definition”).
From LinkML’s point of view, there is no bug here, and nothing to improve. This is all working exactly as designed.
So for “a a date-typed slot is always represented as a datetime.date object” proposal, LinkML developers already thought about that and deliberately said “no, representing date values as strings is the right thing to do.”
Enum values
(Using the case of the predicate_modifier_enum enum here. This of course applies to any other enum accordingly.)
The normal, LinkML-intended way to represent a predicate_modifier_enum-typed value is as a PredicateModifierEnum object, which can be obtained by calling the PredicateModifierEnum constructor with the string value of the enum:
from sssom_schema import PredicateModifierEnum
modifier = PredicateModifierEnum("Not")
That object cannot be compared to a string:
>>> modifier == "Not"
False
Alright, fair enough, after all normal Python enum values also cannot be directly compared to strings.
But it also cannot be compared to the PermissibleValue objects listed in the PredicateModifierEnum class:
>>> modifier == PredicateModifierEnum.Not
False
Which is not how a normal Python enum would behave, and how most if not all Python developers would expect an enum to behave.
Once you have the modifier value, the normal, LinkML-intended way to compare it to one of the permissible values is to do this:
>>> modifier == PredicateModifierEnum(PredicateModifierEnum.Not)
True
This is all by design. LinkML developers think the above is perfectly fine as long as it is clearly documented (which it is not, incidentally, unless you consider that a GitHub ticket counts as documentation).
So for the “a a enum-typed slot is always represented as a plain str” proposal, again LinkML developers have already said ”no”.
SSSOM-Py’s own behaviours
Independently of LinkML’s inherent (and, in my not-so-humble opinion, highly dubious) behaviours, there is also the fact that SSSOM-Py does its own things on top of the LinkML runtime. Notably, it regularly goes back and forth between the LinkML-intended representation of a mapping set (as an instance of the MappingSet class, containing instances of the Mapping class) and its own, completely custom representation as a MappingSetDataFrame.
It is probably during those repeated conversions between a MappingSet and a MappingSetDataFrame that value types get completely mangled and we end up with a situation in which the same slot can be represented using completely different types depending on which parsers it came from and whether the slot is in the metadata part or the data frame part of a MappingSetDataFrame.
There is obviously nothing that LinkML can do for us here.
To rule out any misunderstanding, I am not suggesting any of the following:
- that we stop using LinkML altogether for everything, including defining the model itself;
- that we stop using the LinkML runtime for everything;
- that we remove the LinkML-generated classes and enums (everything in
sssom_schema.datamodel).
What I am suggesting is that, when a MappingSetDataFrame object is used, that object should only use “standard” Python types that behave the way a typical Python developer would expect. I think this is the logical conclusion of the design decision (which was made a long time ago) of having this MappingSetDataFrame thing at all, which is already a very different way of representing a mapping set than what the LinkML-generated classes are imposing.
Now for how to do that, I can see two different approaches:
(A) Keep the existing back and forth between MappingSetDataFrame (SSSOM-Py’s own custom representation) and MappingSet (the LinkML-defined representation), but amend the methods that performs the conversion from MappingSet to MappingSetDataFrame (e.g., MappingSetDataFrame.from_mapping_set() and the likes) so that they always take care of converting the weird LinkML types into the standard Python types we want.
(B) Remove all the back and forth between MappingSetDataFrame and MappingSet for all routine SSSOM-Py operations. This way, we no longer have to worry about what the repeated conversions between the two forms can do to the types of the values.
In (B), we can still keep the MappingSet representation and the conversion methods between that representation and the MappingSetDataFrame (so that people who, for whatever reason, need the “pure LinkML” representation of a mapping set can still obtain it), but that representation would never be used within SSSOM-Py.
This means, for example, that the sssom.rdf_util.rewire_graph function would have to be rewritten so that it does not do this kind of thing:
mdoc = to_mapping_set_document(mset)
for m in mdoc.mapping_set.mappings:
if m.predicate_id in {"owl:equivalentClass", "owl:equivalentProperty"}:
...
(That, is, converting a MSDF into a MappingSet just so that we can iterate over the mappings and access the slots as normal Python attributes.)
Instead, it would have to do something like this:
for _, row in mset.df.iterrows():
if row[PREDICATE_ID] in ["owl:equivalentClass", "owl:equivalentProperty"]:
...
(That is, staying in the MappingSetDataFrame representation. That is what I fundamentally mean when I say we should “go all the way” with the MappingSetDataFrame concept, instead of constantly juggling between two completely different representations.)
Importantly, the (B) approach implies that we should rewrite all our parsers and writers so they are no longer using the loaders and dumpers of the LinkML runtime (since those loaders and dumpers use the MappingSet representation, imposing some back and forth between that representation and the MappingSetDataFrame one). Instead, we should write our own parsers/writers that can directly produce/write MappingSetDataFrame objects without ever going through the MappingSet representation (that is in fact already the case for the SSSOM/TSV writer; we should generalize that all parsers and writers; my #632 PR is doing that for the SSSOM/RDF format).
So, obviously, the (B) approach would represent a sizeable amount of work. Still, I personally believe it is the best one. Not only it would get rid of the type-mangling issue, but it would also:
- make the overall code more efficient (by suppressing the repeated conversions between the two representations);
- allow us to support extension slots (which currently cannot be supported, since even if a
MappingSetDataFramecan be made to contain slots that are not defined in the LinkML model, any conversion to aMappingSetobject would automatically get rid of those slots); - give us full control over the serialisations.
Of course the two approaches can coexist. In particular, we can use (A) (conversion methods between MappingSet and MappingSetDataFrame that ensure that the correct types are used) as a reasonably quick stop-gap measure while we progressively update the entire SSSOM-Py code base so that it stops using the MappingSet representation (and therefore stops constantly converting from one representation to another).
If you want an example of what I mean by “the back and forth between the two representations”, look at the SSSOM/TSV parsing code:
mapping_set = _get_mapping_set_from_df(df=df, meta=meta)
doc = MappingSetDocument(mapping_set=mapping_set, converter=converter)
return to_mapping_set_dataframe(doc)
We initially have a plain Python dictionary (as obtained from the YAML parser) and a Pandas data frame (as obtained, well, from the Pandas parser), and we turn them into a MappingSet object (containing Mapping objects in its mappings slot) only to immediately turn that object back into a MappingSetDataFrame (i.e., basically into a plain Python dictionary and a Pandas data frame).
All of that so that we can use the LinkML-generated code in the constructors of the MappingSet and Mapping classes to validate the contents of the original dictionary and data frame… Frankly, I think we would be much better off doing the validation ourselves at this point.
Have you investigated the technical overhead you would face by switching to Pydantic serialization for the model instead of Python data classes? The LinkML community is moving (slowly) towards this, but we have a lot of folks still using dataclasses.
Have you investigated the technical overhead you would face by switching to Pydantic serialization for the model instead of Python data classes?
No (at least I haven’t – maybe someone else did), but even if we switched to Pydantic classes that would not change the fundamental problem of SSSOM-Py, which is that we are constantly juggling between two distinct representations:
- the LinkML-generated dataclass-based representation (
sssom_schema.MappingSetand the likes); - our own Pandas-based representation (dictionary + data frame, i.e. the
sssom.util.MappingSetDataFrameclass).
If we simply replace the dataclass-based representation by the Pydantic-based one, then instead of constantly juggling between a dataclass-based MappingSet and MappingSetDataFrame, we would constantly juggle between a Pydantic-based MappingSet and MappingSetDataFrame.
Ultimately, SSSOM-Py developers need to decide once and for all how they want mapping set objects to be represented and manipulated in Python.
We have, not two, but three different possibilities:
- a mapping set is represented as an instance of a LinkML-generated, dataclass-based
MappingSetclass; - a mapping set is represented as an instance of a LinkML-generated, Pydantic-based
MappingSetclass; - a mapping set is represented as an instance of a custom
MappingSetDataFrameclass.
The first and third possibilities currently co-exist (with difficulty…) in SSSOM-Py.
I personally think that ideally, SSSOM-Py should provide one, and only one, way of representing a mapping set. Having several completely different representations for the same objects is just needlessly confusing, and a great source of inefficiency and bugs (because of the aforementioned constant back and forth between the representations).
And since most of the existing code base (and more importantly most of the public API) revolves around the MappingSetDataFrame representation, I believe this is the one that we should keep (for all its flaws). If we do keep in parallel a LinkML-generated representation (be it the dataclass-based one or the Pydantic one), its use should be kept minimal – in particular there should be no constant back and forth between it and MappingSetDataFrame.
Now what do SSSOM-Py developers think and what do they want to do?
If we simply replace the dataclass-based representation by the Pydantic-based one, then instead of constantly juggling between a dataclass-based
MappingSetandMappingSetDataFrame, we would constantly juggle between a Pydantic-basedMappingSetandMappingSetDataFrame.
Though there would be at least one clear benefit of such a switch: the Pydantic-based representation does optionally allow the use of “extra” fields that are not present in the model (gen-pydantic --extra-fields=allow), which should allow to represent SSSOM’s “extension slots”.