bio_embeddings icon indicating copy to clipboard operation
bio_embeddings copied to clipboard

ProteinBert Support?

Open ddofer opened this issue 4 years ago • 3 comments

Hi, What would be needed in order to add support for ProteinBert embeddings?

(The framework/package is written with Keras/TF, and also supports GO annotations): https://github.com/nadavbra/protein_bert

ddofer avatar Jun 14 '21 08:06 ddofer

Hi @ddofer , thanks for the ping.

We atomicize the operations. So what you would propose has two components:

  1. The embedder (gets a protein sequence --> returns a per-residue/protein embedding). This goes in the embed stage of the pipeline
  2. The supervised prediction models (gets a protein embedding from your model --> returns a prediction). This goes in the extract stage of the pipeline

I'm going to give a high-level overview of 1 for now, and will write up something about 2 later. Mind that the embedder part is the "harder" one, the feature extractions tends to be much easier :)


Creating an embedder

To create a new embedder

  • you should crate a new class in the embedders dir, which should extend the base embedder interface. Here, you have two options:
    • directly use your package and write an adapter
    • Re-implement model loading and embedding generation.
  • you should somehow pass us the required files or directory containing the weights of your model, which I will make available online, and add here: https://github.com/sacdallago/bio_embeddings/blob/develop/bio_embeddings/utilities/defaults.yml
  • To make the embedder part of the pipeline, you need to extend the pipline.py file by adding a new protocol
  • Define any new dependencies as additional extra
  • [Optional but better] some minimal tests

I would suggest to start with creating a new "embedder" class (my first point). To do so, I suggest you fork the repo and open a PR as soon as you change something, both @konstin and I will help you out through the process ;)

sacdallago avatar Jun 15 '21 09:06 sacdallago

To add to what @sacdallago said:

  • I'd recommend using the pytorch version of your model if possible since most of our models are currently pytorch and none tensorflow
  • Implementing EmbedderInterface in a new embedder class is the core task, the remainder is just boilerplate to connect stuff
  • The simplest embedder is OneHotEncodingEmbedder, a more complex example is the other, hugginface-based Bert called ProtTransBertBFDEmbedder which includes optional features such as batching (which in some cases greatly improves inference time) and CPU fallback (for when a protein is to long to be embedded in GPU memory)
  • For adding a test, see here and add your embedder class there. Note that running the entire test suite takes a long time, so you might want to only run the one you added.

Personally I think it's great to have another smaller bert model!

konstin avatar Jun 15 '21 21:06 konstin

@ddofer Any news? :)

prihoda avatar Jan 14 '22 11:01 prihoda