ProteinBert Support?
Hi, What would be needed in order to add support for ProteinBert embeddings?
(The framework/package is written with Keras/TF, and also supports GO annotations): https://github.com/nadavbra/protein_bert
Hi @ddofer , thanks for the ping.
We atomicize the operations. So what you would propose has two components:
- The embedder (gets a protein sequence --> returns a per-residue/protein embedding). This goes in the
embedstage of the pipeline - The supervised prediction models (gets a protein embedding from your model --> returns a prediction). This goes in the
extractstage of the pipeline
I'm going to give a high-level overview of 1 for now, and will write up something about 2 later. Mind that the embedder part is the "harder" one, the feature extractions tends to be much easier :)
Creating an embedder
To create a new embedder
- you should crate a new class in the embedders dir, which should extend the base embedder interface. Here, you have two options:
- directly use your package and write an adapter
- Re-implement model loading and embedding generation.
- you should somehow pass us the required files or directory containing the weights of your model, which I will make available online, and add here: https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/utilities/defaults.yml
- To make the embedder part of the pipeline, you need to extend the pipline.py file by adding a new protocol
- Define any new dependencies as additional extra
- [Optional but better] some minimal tests
I would suggest to start with creating a new "embedder" class (my first point). To do so, I suggest you fork the repo and open a PR as soon as you change something, both @konstin and I will help you out through the process ;)
To add to what @sacdallago said:
- I'd recommend using the pytorch version of your model if possible since most of our models are currently pytorch and none tensorflow
- Implementing
EmbedderInterfacein a new embedder class is the core task, the remainder is just boilerplate to connect stuff - The simplest embedder is OneHotEncodingEmbedder, a more complex example is the other, hugginface-based Bert called ProtTransBertBFDEmbedder which includes optional features such as batching (which in some cases greatly improves inference time) and CPU fallback (for when a protein is to long to be embedded in GPU memory)
- For adding a test, see here and add your embedder class there. Note that running the entire test suite takes a long time, so you might want to only run the one you added.
Personally I think it's great to have another smaller bert model!
@ddofer Any news? :)