bounty-program
bounty-program copied to clipboard
Create an implementation of an SQL encrypted query on a clear database using TFHE-rs
Overview
TFHE-rs is a pure Rust implementation of TFHE for Boolean and integer arithmetics over encrypted data.
With this bounty, we are asking users to implement an SQL encrypted query on a clear database.
Here's a simplified overview of how FHE can be used to implement an SQL encrypted query on a clear database:
1. Query Encryption The first step involves the user encrypting their SQL query using their FHE secret key before sending it to the server hosting the database. The encryption ensures that the server cannot see the actual query, only its encrypted form.
2. Processing Encrypted Query on Clear Database The server then processes this encrypted query on the clear database. FHE computations mixing clear and encrypted data will return encrypted data, it means the server can perform the necessary SQL operations (like SELECT, JOIN, etc.) using the clear data as well as the encrypted query. The result of this operation is an encrypted version of the query result.
3. Decrypting the Result Finally, the encrypted result is sent back to the user, who can decrypt it using their private key to obtain the clear text result of the SQL query.
We ask that you, the hunters, implement a database that can manipulate signed and unsigned integers (8, 16, 32, 64 bits), booleans and short strings (32 ASCII characters) as possible data types for a column.
As we are not looking at a classical database use case, we don’t expect the database to be resilient to crashes, be highly available etc. it is merely a datastore addressed in a similar way to a database via an encrypted SQL query.
How to participate?
1️⃣ Register here. 2️⃣ When ready, submit your code here. 🗓️ Submission deadline: May 12th, 2024.
Description
Expected Work
The implementation should live in its own crate (and not in tfhe-rs/tfhe/example) and depend on either tfhe-rs 0.5.x or the main branch of tfhe-rs.
We ask that your crate passes the clippy checks when treating warnings as errors.
cargo clippy -- --no-deps -D warnings
We ask you to provide both an executable and an API.
Executable
For the clear database storage you can use existing databases or tools but we require that you manage to load a database from a directory with the following structure:
db_dir
- table_1.csv
- table_2.csv
example of a table_content.csv:
column_1_name:uint32,colum_2_name:bool,column_3_name:string
0,true,"first line"
123,false,"some other line"
The executable takes as input a database to load with the format discussed earlier and a file containing the SQL query in plaintext to avoid dealing with escaping quotes on the command line, we require the following format for the output:
$ cargo run --release -- --input-db /path/to/db_dir --query-file query.txt
Runtime: 42.24 s
Clear DB query result: (some result)
Encrypted DB query result: (some result)
Results match: YES
erroneous results will automatically disqualify a submission.
API
You are free to use the High Level API with CPU or GPU or the integer API with CPU or GPU.
We only expect you to implement ONE API, implementing more APIs won’t improve you submission’s classification, correctness is mandatory, incorrect implementations will be disqualified, performance will be the main differentiator for submissions.
For the encryption we ask that you use a Select query from the sqlparser crate as input. You may choose another type if you explain how it's a better choice. You must return your own EncryptedQuery representation that can be used on the device of your choice.
There are 3 possible APIs to implement
- CPU Integer API (uses tfhe::integer::ServerKey, an EncryptedQuery, a Table struct representing the clear database),
- GPU Integer API (uses tfhe::integer::CudaServerKey and a CudaEncryptedQuery, a Table struct representing the clear database)
- High-Level API CPU/GPU (uses tfhe::ServerKey, EncryptedQuery built on HL API types, a Table struct representing the clear database)
The APIs are specified in detail below, choose the one(s) you wish to implement (only implement methods of the API you chose)
CPU Integer API
fn default_cpu_parameters() -> PBSParameters;
fn encrypt_query(query: sqlparser::ast::Select) -> EncryptedQuery;
/// Loads a directory with a structure described above in a format
/// that works for your implementation of the encryted query
fn load_tables(path) -> Tables
/// # Inputs:
/// - sks: The server key to use
/// - input: your EncryptedQuery
/// - tables: the plain data you run the query on
///
/// # Output
/// - EncryptedResult
fn run_fhe_query(
sks: &tfhe::integer::ServerKey,
input: &EncryptedQuery,
data: &Tables,
) -> EncrypedResult;
/// The output of this function should be a string using the CSV format
/// You should provide a way to compare this string with the output of
/// the clear DB system you use for comparison
fn decrypt_result(clientk_key: &ClientKey, result: &EncryptedResult) -> String
GPU Integer API
fn default_gpu_parameters() -> PBSParameters;
fn encrypt_query_gpu(query: sqlparser::ast::Select) -> CudaEncryptedQuery;
/// Loads a directory with a structure described above in a format
/// that works for your implementation of the encryted query
fn load_tables(path) -> Tables
/// # Inputs:
/// - sks: The server key to use
/// - input: your CudaEncryptedQuery
/// - tables: the plain data you run the query on
///
/// # Output
/// - EncryptedResult
fn run_fhe_query_gpu(
sks: &tfhe::integer::ServerKey,
input: &CudaEncryptedQuery,
data: &Tables,
) -> EncrypedResult;
/// The output of this function should be a string using the CSV format
/// You should provide a way to compare this string with the output of
/// the clear DB system you use for comparison
fn decrypt_result_gpu(clientk_key: &ClientKey, result: &CudaEncryptedResult) -> String
High-Level API
/// Returns parameters and the device (CudaGpu/Cpu) on which the function /// should be ran
fn default_parameters() -> (PBSParameters, tfhe::Device);
fn load_tables(path) -> Tables
fn encrypt_query(query: String) -> EncryptedQuery;
// The server key is provided to allow setting up threads if needed
fn fhe_query_gpu(
sks: &tfhe::integer::ServerKey,
input: &EncryptedQuery,
data: &Tables,
) -> EncrypedResult;
fn decrypt_result(clientk_key: &ClientKey, result: &EncrypedResult) -> String
SQL support
We ask to be able to run the following encrypted queries:
Requested: SQL Select: https://www.w3schools.com/sql/sql_select.asp SQL Select Distinct: https://www.w3schools.com/sql/sql_distinct.asp SQL Where: https://www.w3schools.com/sql/sql_where.asp SQL And: https://www.w3schools.com/sql/sql_and.asp SQL Or: https://www.w3schools.com/sql/sql_or.asp SQL Not: https://www.w3schools.com/sql/sql_not.asp SQL In: https://www.w3schools.com/sql/sql_in.asp SQL Between: https://www.w3schools.com/sql/sql_between.asp
For integer values we expect you to manage the <, <=, >, >=, = operators For strings we expect you to manage the = operator only
Note that the != (not equal) operator is built via the NOT operator of SQL and an = operator.
Note that some operators may re-use other implementations, for example the IN operator is indicated to be equivalent to multiple OR operators.
Bonus API:
SQL Joins: https://www.w3schools.com/sql/sql_join.asp
The format of the encrypted query and result are free, the evaluation of this bounty will be on speed and correctness on the Requested APIs, bonus APIs are NOT required and would only be used as a last resort to differentiate two submissions.
Other Details
Benchmarking
Implementations will be benchmarked on the following AWS hardware: CPU: hpc7a.96xlarge (192 vCPU, 768 GB Memory) GPU: p3.2xlarge (note that TFHE-rs only supports single GPU execution for now) (1 Tesla V100, 16 GB GPU Memory)
Parameters
The allowed parameters are:
PARAM_MESSAGE_2_CARRY_2_KS_PBS, PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_2_KS_PBS, PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS
Your solution must be able to run with the multi bit and non multi bit parameter sets, here it should be straightforward as the message and carry moduli are the same.
Note that the multi bit parameters should be faster on GPU, namely the group 3 should be the fastest. It could be the case on CPU as well but it can saturate even the largest CPU machines with threads and performance suffers as a consequence.
Turtle shell queries
As Mario Kart has its turtle shells, you will be able, if you want to, to provide one example of a small DB (with the directory and csv structure discussed earlier) and a “turtle shell” query you feel is very hard/has corner cases alongside your submission, we will run the DB and the query on the submissions we gather, corner cases beware! The performance and correctness on these hunter provided use cases will be used for the evaluation of the bounty, in addition to cases we’ll choose ourselves, we ask that the query runs in under 5 minutes on a laptop/commodity hardware, though our benchmarking machine are larger we want to keep runtimes reasonable when possible.
Good luck!
Reward
🥇Best submission: up to €5,000
To be considered best submission, a solution must be efficient, effective and demonstrate a deep understanding of the core problem. Alongside the technical correctness, it should also be submitted with a clean code, clear explanations and a complete documentation.
🥈Second-best submission: up to €3,000
For a solution to be considered the second best submission, it should be both efficient and effective. The code should be neat and readable, while its documentation might not be as exhaustive as the best submission, it should cover the key aspects of the solution.
🥉Third-best submission: up to €2,000
The third best submission is one that presents a solution that effectively tackles the challenge at hand, even if it may have certain areas of improvement in terms of efficiency or depth of understanding. Documentation should be present, covering the essential components of the solution.
Reward amounts are decided based on code quality and speed performance on a m6i.metal AWS server.
Related links and references
Questions?
Do you have a specific question about this bounty? Simply comment this issue and our team will get back to you!
Can we assume that the encrypted query is a tuple of multiple encrypted components parsed from the original SQL query?
I saw a similar approach in the following paper for an equality comparison. Doing so will likely mean loosing some privacy with regard to access patterns. However, we can likely also deduce an access pattern based on using a clear database.
https://eprint.iacr.org/2015/1176.pdf
@BraunPhilipp the nature of the query should remain concealed, the format after encryption is up to you
fn decrypt_result(result: &EncryptedResult) -> String probably needs a decryption key as input?
As the result is a String, I assume the output is expected to be a string representation of whatever data has been selected?
For example, if we run SELECT column_1_name, colum_2_name FROM table_content WHERE column_3_name="first line" on the example table above (table_content.csv), the output would be "0, true". Correct?
fn decrypt_result(result: &EncryptedResult) -> Stringprobably needs a decryption key as input?As the result is a
String, I assume the output is expected to be a string representation of whatever data has been selected?For example, if we run
SELECT column_1_name, colum_2_name FROM table_content WHERE column_3_name="first line"on the example table above (table_content.csv), the output would be"0, true". Correct?
you are right, that's an oversight, decryption needs the secret key, the string is just whatever equivalent form of the data to easily compare with your cleartext query (my guess is it's easier to stringify the result of the clear query rather than putting the encrypted result formatted as a DB entry)
If there is multiple rows that match, should we support returning multiple rows? Or just return one of the matching rows?
If there is multiple rows that match, should we support returning multiple rows? Or just return one of the matching rows?
has to match the expected SQL behavior, so you should run the clear query on the SQL system you use as reference and check they are the same, so if the clear query returns several rows, then return the several rows
I don't think this works. You would need to hold enough space in the return value to hold the complete database. Reason: You may as well select the whole database, but the server can't know because it doesn't see the query. So it always needs to reserve enough space to potentially fit the whole DB into the result. Or am I getting something wrong?
I don't think this works. You would need to hold enough space in the return value to hold the complete database. Reason: You may as well select the whole database, but the server can't know because it doesn't see the query. So it always needs to reserve enough space to potentially fit the whole DB into the result. Or am I getting something wrong?
No I think that if you do it to be completely oblivious that's correct, that's what we are looking at for now, we don't expect to run queries on super large DBs, I mean we know some implementations are impractical though they work (like the ecdsa signature we did a while ago)
OK, just to be clear: So you actually expect the encrypted result to be large enough to hold the whole DB?
Indeed, that seems to get impractical very quickly considering FHE ciphertext sizes. And thinking about it further: If you always return something that is at least the size of the DB, couldn't you just return the full DB always and let the client do the searching? (If it can handle downloading the full DB, it can surely handle searching on it.) 🤔
fully agree, let's say the ideal setting is that the data is too large to download, but you need to be able to query it without the holder seeing what you query
obviously in an FHE setting because of the ciphertext size it's insane, but there should be improvements on that end, the remaining issue is just that such a query performed in a fully oblivous manner looks to be infeasible while returning reasonably sized data
I guess future bounties or research avenues is to make leaky queries with a protocole allowing to restrict the size of the data returned, but that's not the object of this bounty :)
here it's a kind of "let's make a toy sized bounty of something we would ideally want to do but that might remain impractical"
Also perhaps for a very smart SQL query the integer layer is not the one to use, but asking some deep crypto knowledge/implems (like dropping to the LWE/GLWE level) is not what bounties are about but more about using the lib, seeing what's doable and what's not doable
A middle ground could be accepting a cleartext LIMIT on every query:
SELECT * FROM table LIMIT 3
https://www.w3schools.com/sql/sql_top.asp
(Note that LIMIT seems to be MySQL syntax, whereas SQL TOP seems to be the general standard.)
has to match the expected SQL behavior, so you should run the clear query on the SQL system you use as reference and check they are the same
Note that SQL results are not guaranteed to have any specific order (if you don't use order by). Hence, you could get different outputs on two different runs of two correct implementations.
hash maps to the rescue for the unspecified order, I don't know about the LIMIT one but we will not feed huge DBs for the bounties, as computation will likely be impractical in terms of runtime anyways
@BraunPhilipp the nature of the query should remain concealed, the format after encryption is up to you
To clarify the expected work:
- Are you also expecting the operators like AND, OR, >, <, =, etc. to be encrypted as well ?
- Must "SELECT DISTINCT" be server-side executed only (no post-processing on the client side) ?
execution on the server side for the select distinct, otherwise a way to answer to all queries is to return the whole database under the secret key of the user and let the client process the data, you can see how it's a bit of a loop hole if we accept post processing on the client side
we expect the query to be encrypted, I don't think we have anything about hiding the length of the query for this bounty
- OK : No post-processing what so ever on the client-side
- If operators are also encrypted (I guess, this is what you are expecting), then the return value from the server must be some sort of NxP sparse-matrix (where N=number of rows of the DB and P=number of columns of the DB), right ?. It would be up to the client to extract the final result, or am I missing something ?
yes for the sparse part you are correct, the FHE server won't be able to diminish the size of the output due to the oblivious nature of the processing, having the client parse the result and ignore rows/results marked as empty is fair game
- Should types of operands be encrypted as well ? (I guess, it should be encrypted)
- If so, the server side code should only manipulate vectors of FhUint8 (ie: the 'EncryptedQuery' struct mentioned above as an example should only contain one single FHE type)? (I'm afraid the final program may practically be running on only tiny tiny DBs)
Format of the EncryptedQuery is free, as long as the query is encrypted and therefore protected from prying eyes
-
Are we allowed to encode SQL Query numerical values used multiple times with the same FHE-encrypted value (memory-wise)? For example:
SELECT CustomerName FROM Customers WHERE NbOrders > 0 AND TotalPayments > 0The integer0appears 2 times. Using the same encrypted value sent from the client to the server would allow additional optimisations on the server. -
Are we allowed to encrypt SQL Query values using the smallest possible
FheUinttype individually for each value ? Doing so may leak value range information for each value parameter sent from the client to the server. However, this would allow additional optimisations on the server. For example:SELECT CustomerName FROM Customers WHERE NbOrders > 0 AND TotalPayments > 12000can we encode0as aFheUint2and12000as aFheUint16? or do we have to encode every numerical values using the same exact uniform FHE type.
Hmm the encrypted values should match the type of the column I think to hide the encrypted range ?
Hmm the encrypted values should match the type of the column I think to hide the encrypted range ?
Sure, that would obviously hide any range.
But, that would require the client to know the exact type of each individual columns in the table (an any other tables in case of JOIN), right ?
What about question 1) ?
ok that's a good point, I would guess the header of the DB is known ? So that the user knows how to encrypt the data to send
As for 1, I'm not sure yet, will discuss to see what other think internally as to what we want
for 1: redundant encryption to avoid leaking the fact some data is the same
After having thought about your answer, it looks to me that a small leaking problem still persists.
Suppose, for example, some table contains only one single column with 'u64' data in it. If the server encounters a FheUint64 somewhere in the encrypted query, the server knows by default which column the client is scanning. That's great from a server-side-optimisation standpoint, but may not be exactly what you are looking for ?
One possible solution would be to encrypt every single SQL Query element using FheUint256 (to include 32-characters ASCII short strings) therefore avoiding any leakage.
I guess you are trying to test the scenario of a SQL server having zero-idea of what the client is looking up in the tables, right? (Confidential SQL query, a very interesting scenario BTW)
ah that's a good point, which is super annoying because then it means that anything will always be in the very worst case...
also it will be super slow :|
ah that's a good point, which is super annoying because then it means that anything will always be in the very worst case...
also it will be super slow :|
No ! probably not.
Downcasting looks super fast (last time I checked), it should be fine (because the server knows the type of the column). However, some pre-processing is required on the client-side before running the query, like throwing errors and check data overflow. It would be much too costly on the server.
Besides, those nice little optimisations I was counting on (which were the subjects of my first 2 questions) are gone ...
if the server does not know what columns to look at, it will have to use FheUint256 anyways no ?
Any idea of what could be the benchmark results on 'FheBool', 'FheUint2', 'FheUint4' using Bitwise ops on AMD EPYC 9R14 CPU @ 2.60GHz and 740GB of RAM ? There seem to be missing in your online doc ? https://docs.zama.ai/tfhe-rs/getting-started/benchmarks#boolean on I am missing something ?
Boolean is the boolean API
For an FheBool and an FheUint2 you should have about 12 ms on our bench machine, FheUint4 should be only slighlty slower