bounty-program Privacy preserving prediction of molecular properties

Category: Application

Overview

We at VaultChem, a startup combining encryption and chemistry, aim to use Zamas Concrete ML library for FHE inference of molecular properties. Consider a scenario where pharma company A is interested in predicting properties of candidate molecules, before phase I clinical trials.

In particular, understanding the processes of absorption, distribution, metabolism, and excretion (ADME) is crucial for determining a drug candidate's concentration profile at its action site, significantly impacting the drug's effectiveness.

However, A does not have sufficient data available for reliable ML predictions. Instead, A will securely obtain predictions on molecular data from an untrusted party B that owns a secret database and an ML model with sufficient training data. This is only possible using FHE to guarantee party A will not reveal the secret query to party B.

We will simulate this scenario using open-source chemistry datasets. We will provide tools (based on cheminformatics rdkit and concrete-ml) to give an end-to-end solution to the problem of privacy-preserving prediction. We will deploy the app to hugging face (similar to the FHE image filter) and provide detailed tutorials/notebooks that explain each step. Finally, in comparing against sklearn implementations, we will also investigate the accuracy versus computational cost trade-off as computational screening in cheminformatics may require fast predictions on thousands of molecules. We provide an outlook on how to account for increased computational costs due to FHE inference in the case of molecular data.

Total Reward: 3500 € (split by milestones)
Description
- The main goals are, implementing a toolchain that allows users to input molecular data and use Zamas’ Concrete ML library to predict the properties of these molecules. The goal is to show the feasibility of concrete ml in the field of chemistry and pharmaceutical data as well as making a demo available on Huggingface.
Milestones
1. Data processing and identification of best-performing model in Concrete ML
  1. Time estimation: 3 days
  2. Reward: 2000 €
  3. Tasks:
    1. Prepare processing of chemical data using RdKit for use with ML.
    2. Compare accuracy as well as prediction time for various Concrete ML-built in Models, identify models suitable for deployment, and train for FHE execution
  4. Deliverables: Scripts for data preparation and training. Save the identified best model.
2. Model deployment on huggingface
  1. Time estimation: 2 days
  2. Reward: 1000 €
  3. Tasks:
    1. Deploy the model to our huggingface space with simple user input following the same logic as in the FHE image filter. The user inputs the molecules in SMILES format ([Simplified molecular-input line-entry system](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system))
    2. Add visual elements that explain FHE in case of molecular property prediction
  4. Deliverables: All files needed for the huggingface space, adapted to run the model identified in the previous step. Interface written in Gradio. Input molecules from the user will be displayed graphically. In addition, include a visual representation of the FHE execution analogous to the sketch on ZAMAs FHE image filter on huggingface.
3. Documentation and Tutorials
  1. Time estimation: 1 day
  2. Reward: 500 €
  3. Task:
    1. A Jupyter notebook, containing code examples and illustrations, showcasing the use of FHE for chemical data.
    2. Discussion on model accuracy versus computational costs for predictions.
  4. Deliverables: Notebooks with mentioned content
References
- Our Startup VaultChem, www.vaultchem.com
- ADME dataset, published in https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.3c00160
- Cheminformatics library, https://www.rdkit.org/
- FHE image filter, https://huggingface.co/spaces/zama-fhe/encrypted_image_filtering
- Paper by one of us (using MPC instead of FHE for a similar problem): https://iopscience.iop.org/article/10.1088/2632-2153/acc928/meta

Dec 04 '23 10:12 janweinreich

Hello janweinreich,

Thank you for your bounty proposition! Our team will review and add comments in your issue! In the meantime:

Join the FHE.org discord server for any questions (you’ll find a dedicated #zama-bounty-program channel).
Ask questions privately: [email protected].

Talk soon,

Dec 04 '23 10:12 zama-bot

Good news @janweinreich, this bounty is accepted! You could start the work and as soon as you have a milestone complete, ping us on discord to review it and reward the corresponding amount.

Dec 07 '23 13:12 aquint-zama

thank you getting started right away!

Dec 07 '23 15:12 janweinreich

Work is in full progress! However, we wanted to notify you that we are considering pivoting to a different target: Instead of predicting the toxicity of a compound, we would like to predict properties as published in this paper (https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.3c00160)

For instance, how much of a substance is absorbed in the liver? The reasons we want to change the dataset are:

Some concerns about it is wise to build a model that can easily make a wrong prediction (nontoxic) on a toxic compound
Poor quality of the dataset we were planning to use originally
We can use Linear models for the other dataset with reasonable accuracy (+ much faster for FHE)

All other points of the proposal would remain unchanged!

Dec 20 '23 17:12 janweinreich

It's ok with your updates, could you update the main issue to reflect this changes? Thank you 🙏

Dec 26 '23 09:12 aquint-zama

thank you, I made the minor changes to the main post of this issue. Most of the code is there we just need to clean and document it well. Will contact you as soon as this is done!

Dec 28 '23 15:12 janweinreich

Hey Jan

Regarding the HF space: in general, it’s already very nice, very clear, very straight to the point, it’s an excellent bounty!

could you point on https://github.com/zama-ai/concrete-ml instead of docs.zama.ai/concrete-ml, when you speak about Concrete ML, please? (except when you explicitly want to link to the documentation)
if you want an explanation link for FHE, maybe https://fhe.org/resources/ would be a good link
I like you graphics! maybe they are a bit too small, the fontsize looks much smaller than the HF fontsize
I miss the “expected value”, to check that things happen correctly: would it be possible to add it somewhere, or to say if we have the expected result? eg, I tested HLM with Input molecule CC(=O)OC1=CC=CC=C1C(=O)O [I am not even sure if it was Aspirin or Ibuprofen, maybe you should log it] and I get “The Molecule CC(=O)OC1=CC=CC=C1C(=O)O has a HLM value of 0.31 (mL/min/kg)“, but I don’t know if it’s good or not

Regarding https://github.com/vaultchem/molvault:

there are some “ZAMAs” or “concrete-ml” typos: it’s “Zama” and “Concrete ML” (or Concrete-ML if you prefer) please. In the README, in the notebooks, maybe a grep would be needed :D
could you point on https://github.com/zama-ai/concrete-ml instead of docs.zama.ai/concrete-ml, when you speak about Concrete ML, please? (except when you explicitly want to link to the documentation)
I like your https://github.com/vaultchem/molvault/blob/main/examples/tutorial/tutorial.ipynb ! I would link to it in the README (you currently mention it without a link)
you seem to be tight to concrete-ml==1.3.0 today; if I were you, I would have a try with the fresh 1.4 Concrete ML, since tree-based models are like twice faster with it

Once again: very nice work that you’ve done, I can’t wait we can let our marketing publish about it! Cheers

Feb 01 '24 13:02 bcm-at-zama

Thanks for your feedback!

I will get to the points as soon as possible. All the aspects about references and visual adjustments will be fixed.
About the expected value it is difficult to say because clearly the database is small. If a user tests an arbitrary molecule we just do not have any reference value to compare with. But I can add a few comparisons with molecules for which we have the data
About the graphics, it was suggested to me to simplify the graphics further and have it as the first element before the text because this is more likely to catch attention
Will add a reference to the tutorial and also mention in the respective section that the timings reported were with version 1.3.0 and it is recommended to update

Feb 01 '24 14:02 janweinreich

Thanks!

but for the HF space, in the inputs, we can only put Aspirin or Ibuprofen, so you can already find expected values, no?
for the graphic: sure, if you want our inputs, you'll say
for the timings and requirements.txt, you don't want to redo with 1.4.0 CML? it's too long to do? really things should be much faster

Feb 01 '24 15:02 bcm-at-zama

Sure I can see if these molecules are in the test set and add the reference values.

No problem, rerunning the models as I write with new version of concrete-ml. Timings will be updated accordingly

Feb 02 '24 14:02 janweinreich

Thank you for your patience @bcm-at-zama !

Fixes

Fixed spelling
The "popular" molecules aspirin and ibuprofen are not in the dataset: we cannot compare against these values. Instead, I added a comparison of the predicted values with all the values in the dataset (see screenshot) to allow insight if the predicted value is large or small

Screenshot from 2024-02-12 10-04-36

Added timing comparison for CML 1.3 and 1.4 (see figure). Indeed 1.4 provides a significant speedup for XGB, however, given in our case a linear model is just as accurate we stay with this. The figure shows the per-element prediction time (averages over 10 samples) as a function of Depth for different numbers of estimators. requirements.txt updated

timing

(script for timing test, https://github.com/vaultchem/molvault/blob/main/examples/huggingface/fit_fhe/timing_test/timing_FHE.py)

Updated the HF space to CML 1.4. To allow users to test the XGB models they were uploaded to the repo https://github.com/vaultchem/molvault/tree/main/examples/huggingface/app/models and the REAMDE was updated accordingly - now also containing a link to the tutorial

Questions

The repo uses Zamas' Code, although not modified. Still, according to https://github.com/zama-ai/concrete-ml/blob/main/LICENSE will we have to include a copyright notice from Zama in the repository?

The idea was to publish the repo of the bounty under CC-BY. We want to make sure to comply with Zamas' policy, including future developments that may lead to commercial use.

The demo is a chance for the startup (VaultChem) to get in touch with potential customers. Assuming the bounty is approved, how/when did Zama - if at all - plan to share the repository and the link to the demo? If possible, could you share a draft for the post ahead of time!

Thank you!

Feb 12 '24 09:02 janweinreich

For the expected value for Aspirin and Ibuprofen, sorry but it's still not clear to me. You can't find the real values somewhere on the internet, and print them for comparison with the prediction?
For the rest: thanks

Feb 12 '24 16:02 bcm-at-zama