Privacy preserving prediction of molecular properties
Category: Application
Overview
We at VaultChem, a startup combining encryption and chemistry, aim to use Zamas Concrete ML library for FHE inference of molecular properties. Consider a scenario where pharma company A is interested in predicting properties of candidate molecules, before phase I clinical trials.
In particular, understanding the processes of absorption, distribution, metabolism, and excretion (ADME) is crucial for determining a drug candidate's concentration profile at its action site, significantly impacting the drug's effectiveness.
However, A does not have sufficient data available for reliable ML predictions. Instead, A will securely obtain predictions on molecular data from an untrusted party B that owns a secret database and an ML model with sufficient training data. This is only possible using FHE to guarantee party A will not reveal the secret query to party B.
We will simulate this scenario using open-source chemistry datasets. We will provide tools (based on cheminformatics rdkit and concrete-ml) to give an end-to-end solution to the problem of privacy-preserving prediction. We will deploy the app to hugging face (similar to the FHE image filter) and provide detailed tutorials/notebooks that explain each step. Finally, in comparing against sklearn implementations, we will also investigate the accuracy versus computational cost trade-off as computational screening in cheminformatics may require fast predictions on thousands of molecules. We provide an outlook on how to account for increased computational costs due to FHE inference in the case of molecular data.
-
Total Reward: 3500 ⏠(split by milestones)
-
Description
- The main goals are, implementing a toolchain that allows users to input molecular data and use Zamasâ Concrete ML library to predict the properties of these molecules. The goal is to show the feasibility of
concrete mlin the field of chemistry and pharmaceutical data as well as making a demo available on Huggingface.
- The main goals are, implementing a toolchain that allows users to input molecular data and use Zamasâ Concrete ML library to predict the properties of these molecules. The goal is to show the feasibility of
-
Milestones
- Data processing and identification of best-performing model in Concrete ML
- Time estimation: 3 days
- Reward: 2000 âŹ
- Tasks:
- Prepare processing of chemical data using
RdKitfor use with ML. - Compare accuracy as well as prediction time for various Concrete ML-built in Models, identify models suitable for deployment, and train for FHE execution
- Prepare processing of chemical data using
- Deliverables: Scripts for data preparation and training. Save the identified best model.
- Model deployment on huggingface
- Time estimation: 2 days
- Reward: 1000 âŹ
- Tasks:
- Deploy the model to our huggingface space with simple user input following the same logic as in the FHE image filter. The user inputs the molecules in SMILES format ([Simplified molecular-input line-entry system](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system))
- Add visual elements that explain FHE in case of molecular property prediction
- Deliverables: All files needed for the huggingface space, adapted to run the model identified in the previous step. Interface written in Gradio. Input molecules from the user will be displayed graphically. In addition, include a visual representation of the FHE execution analogous to the sketch on ZAMAs FHE image filter on huggingface.
- Documentation and Tutorials
- Time estimation: 1 day
- Reward: 500 âŹ
- Task:
- A Jupyter notebook, containing code examples and illustrations, showcasing the use of FHE for chemical data.
- Discussion on model accuracy versus computational costs for predictions.
- Deliverables: Notebooks with mentioned content
- Data processing and identification of best-performing model in Concrete ML
-
References
- Our Startup VaultChem, www.vaultchem.com
- ADME dataset, published in https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.3c00160
- Cheminformatics library, https://www.rdkit.org/
- FHE image filter, https://huggingface.co/spaces/zama-fhe/encrypted_image_filtering
- Paper by one of us (using MPC instead of FHE for a similar problem): https://iopscience.iop.org/article/10.1088/2632-2153/acc928/meta
Hello janweinreich,
Thank you for your bounty proposition! Our team will review and add comments in your issue! In the meantime:
- Join the FHE.org discord server for any questions (youâll find a dedicated #zama-bounty-program channel).
- Ask questions privately: [email protected].
Talk soon,
Good news @janweinreich, this bounty is accepted! You could start the work and as soon as you have a milestone complete, ping us on discord to review it and reward the corresponding amount.
thank you getting started right away!
Work is in full progress! However, we wanted to notify you that we are considering pivoting to a different target: Instead of predicting the toxicity of a compound, we would like to predict properties as published in this paper (https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.3c00160)
For instance, how much of a substance is absorbed in the liver? The reasons we want to change the dataset are:
- Some concerns about it is wise to build a model that can easily make a wrong prediction (nontoxic) on a toxic compound
- Poor quality of the dataset we were planning to use originally
- We can use Linear models for the other dataset with reasonable accuracy (+ much faster for FHE)
All other points of the proposal would remain unchanged!
It's ok with your updates, could you update the main issue to reflect this changes? Thank you đ
thank you, I made the minor changes to the main post of this issue. Most of the code is there we just need to clean and document it well. Will contact you as soon as this is done!
Hey Jan
Regarding the HF space: in general, itâs already very nice, very clear, very straight to the point, itâs an excellent bounty!
- could you point on https://github.com/zama-ai/concrete-ml instead of docs.zama.ai/concrete-ml, when you speak about Concrete ML, please? (except when you explicitly want to link to the documentation)
- if you want an explanation link for FHE, maybe https://fhe.org/resources/ would be a good link
- I like you graphics! maybe they are a bit too small, the fontsize looks much smaller than the HF fontsize
- I miss the âexpected valueâ, to check that things happen correctly: would it be possible to add it somewhere, or to say if we have the expected result? eg, I tested HLM with Input molecule CC(=O)OC1=CC=CC=C1C(=O)O [I am not even sure if it was Aspirin or Ibuprofen, maybe you should log it] and I get âThe Molecule CC(=O)OC1=CC=CC=C1C(=O)O has a HLM value of 0.31 (mL/min/kg)â, but I donât know if itâs good or not
Regarding https://github.com/vaultchem/molvault:
- there are some âZAMAsâ or âconcrete-mlâ typos: itâs âZamaâ and âConcrete MLâ (or Concrete-ML if you prefer) please. In the README, in the notebooks, maybe a grep would be needed :D
- could you point on https://github.com/zama-ai/concrete-ml instead of docs.zama.ai/concrete-ml, when you speak about Concrete ML, please? (except when you explicitly want to link to the documentation)
- I like your https://github.com/vaultchem/molvault/blob/main/examples/tutorial/tutorial.ipynb ! I would link to it in the README (you currently mention it without a link)
- you seem to be tight to concrete-ml==1.3.0 today; if I were you, I would have a try with the fresh 1.4 Concrete ML, since tree-based models are like twice faster with it
Once again: very nice work that youâve done, I canât wait we can let our marketing publish about it! Cheers
Thanks for your feedback!
-
I will get to the points as soon as possible. All the aspects about references and visual adjustments will be fixed.
-
About the expected value it is difficult to say because clearly the database is small. If a user tests an arbitrary molecule we just do not have any reference value to compare with. But I can add a few comparisons with molecules for which we have the data
-
About the graphics, it was suggested to me to simplify the graphics further and have it as the first element before the text because this is more likely to catch attention
-
Will add a reference to the tutorial and also mention in the respective section that the timings reported were with version 1.3.0 and it is recommended to update
Thanks!
- but for the HF space, in the inputs, we can only put Aspirin or Ibuprofen, so you can already find expected values, no?
- for the graphic: sure, if you want our inputs, you'll say
- for the timings and requirements.txt, you don't want to redo with 1.4.0 CML? it's too long to do? really things should be much faster
Sure I can see if these molecules are in the test set and add the reference values.
No problem, rerunning the models as I write with new version of concrete-ml. Timings will be updated accordingly
Thank you for your patience @bcm-at-zama !
Fixes
-
Fixed spelling
-
The "popular" molecules aspirin and ibuprofen are not in the dataset: we cannot compare against these values. Instead, I added a comparison of the predicted values with all the values in the dataset (see screenshot) to allow insight if the predicted value is large or small
- Added timing comparison for CML 1.3 and 1.4 (see figure). Indeed 1.4 provides a significant speedup for XGB, however, given in our case a linear model is just as accurate we stay with this. The figure shows the per-element prediction time (averages over 10 samples) as a function of Depth for different numbers of estimators.
requirements.txtupdated
(script for timing test, https://github.com/vaultchem/molvault/blob/main/examples/huggingface/fit_fhe/timing_test/timing_FHE.py)
- Updated the HF space to CML 1.4. To allow users to test the XGB models they were uploaded to the repo https://github.com/vaultchem/molvault/tree/main/examples/huggingface/app/models and the REAMDE was updated accordingly - now also containing a link to the tutorial
Questions
- The repo uses Zamas' Code, although not modified. Still, according to https://github.com/zama-ai/concrete-ml/blob/main/LICENSE will we have to include a copyright notice from Zama in the repository?
The idea was to publish the repo of the bounty under CC-BY. We want to make sure to comply with Zamas' policy, including future developments that may lead to commercial use.
- The demo is a chance for the startup (VaultChem) to get in touch with potential customers. Assuming the bounty is approved, how/when did Zama - if at all - plan to share the repository and the link to the demo? If possible, could you share a draft for the post ahead of time!
Thank you!
- For the expected value for Aspirin and Ibuprofen, sorry but it's still not clear to me. You can't find the real values somewhere on the internet, and print them for comparison with the prediction?
- For the rest: thanks