Hi, What would be needed in order to add support for ProteinBert embeddings?

(The framework/package is written with Keras/TF, and also supports GO annotations): https://github.com/nadavbra/protein_bert

Jun 14 '21 08:06 ddofer

Hi @ddofer , thanks for the ping.

We atomicize the operations. So what you would propose has two components:

The embedder (gets a protein sequence --> returns a per-residue/protein embedding). This goes in the embed stage of the pipeline
The supervised prediction models (gets a protein embedding from your model --> returns a prediction). This goes in the extract stage of the pipeline

I'm going to give a high-level overview of 1 for now, and will write up something about 2 later. Mind that the embedder part is the "harder" one, the feature extractions tends to be much easier :)

Creating an embedder

To create a new embedder

you should crate a new class in the embedders dir, which should extend the base embedder interface. Here, you have two options:
- directly use your package and write an adapter
- Re-implement model loading and embedding generation.
you should somehow pass us the required files or directory containing the weights of your model, which I will make available online, and add here: https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/utilities/defaults.yml
To make the embedder part of the pipeline, you need to extend the pipline.py file by adding a new protocol
Define any new dependencies as additional extra
[Optional but better] some minimal tests

I would suggest to start with creating a new "embedder" class (my first point). To do so, I suggest you fork the repo and open a PR as soon as you change something, both @konstin and I will help you out through the process ;)

Jun 15 '21 09:06 sacdallago

To add to what @sacdallago said:

I'd recommend using the pytorch version of your model if possible since most of our models are currently pytorch and none tensorflow
Implementing EmbedderInterface in a new embedder class is the core task, the remainder is just boilerplate to connect stuff
The simplest embedder is OneHotEncodingEmbedder, a more complex example is the other, hugginface-based Bert called ProtTransBertBFDEmbedder which includes optional features such as batching (which in some cases greatly improves inference time) and CPU fallback (for when a protein is to long to be embedded in GPU memory)
For adding a test, see here and add your embedder class there. Note that running the entire test suite takes a long time, so you might want to only run the one you added.

Personally I think it's great to have another smaller bert model!

Jun 15 '21 21:06 konstin

@ddofer Any news? :)

Jan 14 '22 11:01 prihoda

ProteinBert Support?

Creating an embedder