phenopacket-schema icon indicating copy to clipboard operation
phenopacket-schema copied to clipboard

Include the generated files into Python library

Open ielis opened this issue 3 months ago • 6 comments

The PR improves the code of phenopackets - the library for working with phenopackets in Python.

Current state of things

phenopackets consists of the Python bindings for Phenopacket Schema building blocks. The library allows creating phenopackets programatically, as well as JSON and protobuf I/O.

All building blocks are exported from the top level package:

from phenopackets import OntologyClass, Individual, Phenopacket, Gene

or

import phenopackets as pp

oc = pp.OntologyClass()
i = pp.Individual()
p = pp.Phenopacket()

The issue

We do not "namespace" the building blocks. For instance, to use VRS Gene, we do:

from phenopackets import Gene

This is not good, because Gene is here to stay and we cannot add another Gene to the Schema without breaking changes. On top of this, we cannot work with >1 Schema versions (unlike Java).

The proposal

The PR updates the deployment script to maintain the hierarchy set by the proto files. For the time being (open to discussion), we keep the v2.0.2 elements at the top-level, in order to not break existing code. So, the code below will still work:

from phenopackets import OntologyClass, Individual, Phenopacket, Gene

However, we need to discourage the users from using this from now on and this is the new way to import the building blocks:

from phenopackets.schema.v2.base_pb2 import OntologyClass
from phenopackets.schema.v2.individual_pb2 import Individual
from phenopackets.schema.v2.phenopackets_pb2 import Phenopacket
from ga4gh.vrs.v1.vrs_pb2 import Gene

These imports are longer, but they provide several benefits

  • the package structure follows the Phenopacket Schema proto files more closely. I think this is good, since the proto files are the source of truth.
  • the package structure now allows working with v1.0.0 elements, or the future iterations. The following should work after the PR is merged:
    from phenopackets.schema.v1.phenopackets_pb2 import Phenopacket
    
  • the deployment is simpler. The library follows the proto files with close to none human intervention required.

@julesjacobsen @cmungall @pnrobinson please have a look and let me know your thoughs. I hope the PR is not too big, we tried to document well. I am not asking @iimpulse for review since we worked on this together.

ielis avatar Apr 04 '24 17:04 ielis

This is pretty non-idiomatic python

  • in general you don't distribute top level namespaces alongside your own. Occasionally when vendoring is really necessary it's made a sub-namespace
  • versioning isn't typically done at the module level

I would say if we are truly reusing a separate package's concept of gene, then let's coordinate with them, have them release a vrs package to pypi, and just reuse that. If we truly need to vary this then use standard vendoring patterns.

I don't understand the need for versioning at the module/sub-package level. Why not just use standard python versioning? Poetry has really good support for this, we use this elsewhere in Monarch. Just have releases 1.x.x. and 2.x.x on PyPI, and at some point 3.0.0rc1 and then 3.0.0, etc. Follow standard semantic versioning.

One exception to this is when breaking changes are introduced, e.g. with pydantic v2 there is a compatibility layer pydantic.v1 which allows for more graceful transition.

A more radical approach would be to have ga4gh at the top level, have subpackages for ga4gh.phenopackets,vrs,... I don't think ga4gh is coherent enough yet though, so I think your better having phenopackets at the top level.

I can see the point about aligning with the protobuf, but is the protobuf ideally organized? Perhaps this could be reorganized after v2? (and perhaps use something less restrictive than protobuf...)

cmungall avatar Apr 10 '24 18:04 cmungall

Hi @cmungall

yes, these are good points.

in general you don't distribute top level namespaces alongside your own. Occasionally when vendoring is really necessary it's made a sub-namespace

I assume you are referring to the ga4gh.vrs.v1.vrs_pb2 part. Unfortunately, I am not sure we can get around this without breaking changes due to limitations of the code generated by the protobuf compiler. The code generated for phenopacket blocks only works if the VRS parts are importable from ga4gh.vrs.v1.vrs_pb2. This is due to project setup:

image

So yes, we would need VRS people to release Java and Python artifacts. I am not sure, however, I would be heard there. Perhaps @cmungall, @julesjacobsen or @pnrobinson can convince them that they must deploy versions to PyPi if they want their code to be used? As of now, the generated code does not clash with the existing VRS packages on PyPi, we tested that with @iimpulse.

I don't understand the need for versioning at the module/sub-package level. Why not just use standard python versioning? Poetry has really good support for this, we use this elsewhere in Monarch. Just have releases 1.x.x. and 2.x.x on PyPI, and at some point 3.0.0rc1 and then 3.0.0, etc. Follow standard semantic versioning.

Two things. First, we already have this structure in the repo from some reason (see the screenshot above for v1 and v2), so now the Python code would follow the protobuf files more closely. Second, at some point, we will want to remap version x to version x+1 and having more than one version of the data model can simplify the mapping code, e.g. as we do in phenopacket-tools. I think it may be good to e.g. keep the code for the latest major version and the previous one, e.g. v2 and v1. An alternative would be to dump v1 to e.g JSON, and write parser/converter that maps JSON to v2. The aim here is to have strategy for schema evolution, and it is open for discussion.

A more radical approach would be to have ga4gh at the top level, have subpackages for ga4gh.phenopackets,vrs,... I don't think ga4gh is coherent enough yet though, so I think your better having phenopackets at the top level.

I can see the point about aligning with the protobuf, but is the protobuf ideally organized? Perhaps this could be reorganized after v2? (and perhaps use something less restrictive than protobuf...)

Protobuf indeed has some quirks and I would be happy to participate in discussions regarding what is the best for v3. However, until we get there, I think there are ways how to add the enhancements we need without breaking changes.


In overall, I think your points regarding the namespaces, versions, and protobuf are good. Do you think they should be addressed within this PR or addressing them later would be sufficient?

ielis avatar Apr 11 '24 15:04 ielis

To add to the above, I would support exploring a linkML approach to a version 2.1 or 3 of the schema. There a many moving parts that constrain the current version to have some of the oddities it has above, but these are implementation details that extremely few users will care about, and software should hide them for everybody but developers. It would be great @cmungall to find some time to discuss if this is a possibility.

pnrobinson avatar Apr 11 '24 17:04 pnrobinson

@ahwagner 👀 👆

monicacecilia avatar Apr 22 '24 04:04 monicacecilia

I'd like to add that all planning and provisions for a potential v3 of Phenopackets must originate, assess requirements, be discussed, and receive approval in the context of GA4GH ClinPheno. @ielis, if you aren't already, let's ensure you onboard to GA4GH and join the ClinPheno WS.

@pnrobinson @julesjacobsen @cmungall @ahwagner @mcourtot -- 👀 👆

monicacecilia avatar Apr 22 '24 04:04 monicacecilia

Hi @monicacecilia it is understood that the updates must be discussed in the manner you outlined. This PR makes no backward incompatible changes - everything that worked in v2.0.2 or v2.0.0 will continue to work, in line with semantic versioning guidelines.

ielis avatar Apr 22 '24 17:04 ielis