bounty-program icon indicating copy to clipboard operation
bounty-program copied to clipboard

Create an implementation of an SQL encrypted query on a clear database using TFHE-rs

Open zaccherinij opened this issue 1 year ago • 124 comments

Overview

TFHE-rs is a pure Rust implementation of TFHE for Boolean and integer arithmetics over encrypted data.

With this bounty, we are asking users to implement an SQL encrypted query on a clear database.

Here's a simplified overview of how FHE can be used to implement an SQL encrypted query on a clear database:

1. Query Encryption The first step involves the user encrypting their SQL query using their FHE secret key before sending it to the server hosting the database. The encryption ensures that the server cannot see the actual query, only its encrypted form.

2. Processing Encrypted Query on Clear Database The server then processes this encrypted query on the clear database. FHE computations mixing clear and encrypted data will return encrypted data, it means the server can perform the necessary SQL operations (like SELECT, JOIN, etc.) using the clear data as well as the encrypted query. The result of this operation is an encrypted version of the query result.

3. Decrypting the Result Finally, the encrypted result is sent back to the user, who can decrypt it using their private key to obtain the clear text result of the SQL query.

We ask that you, the hunters, implement a database that can manipulate signed and unsigned integers (8, 16, 32, 64 bits), booleans and short strings (32 ASCII characters) as possible data types for a column.

As we are not looking at a classical database use case, we don’t expect the database to be resilient to crashes, be highly available etc. it is merely a datastore addressed in a similar way to a database via an encrypted SQL query.

How to participate?

1️⃣ Register here. 2️⃣ When ready, submit your code here. 🗓️ Submission deadline: May 12th, 2024.

Description

Expected Work

The implementation should live in its own crate (and not in tfhe-rs/tfhe/example) and depend on either tfhe-rs 0.5.x or the main branch of tfhe-rs.

We ask that your crate passes the clippy checks when treating warnings as errors.

cargo clippy -- --no-deps -D warnings

We ask you to provide both an executable and an API.

Executable

For the clear database storage you can use existing databases or tools but we require that you manage to load a database from a directory with the following structure:

db_dir

  • table_1.csv
  • table_2.csv

example of a table_content.csv:

column_1_name:uint32,colum_2_name:bool,column_3_name:string
0,true,"first line"
123,false,"some other line"

The executable takes as input a database to load with the format discussed earlier and a file containing the SQL query in plaintext to avoid dealing with escaping quotes on the command line, we require the following format for the output:

$ cargo run --release -- --input-db /path/to/db_dir --query-file query.txt
Runtime: 42.24 s
Clear DB query result: (some result)
Encrypted DB query result: (some result)
Results match: YES

erroneous results will automatically disqualify a submission.

API

You are free to use the High Level API with CPU or GPU or the integer API with CPU or GPU.

We only expect you to implement ONE API, implementing more APIs won’t improve you submission’s classification, correctness is mandatory, incorrect implementations will be disqualified, performance will be the main differentiator for submissions.

For the encryption we ask that you use a Select query from the sqlparser crate as input. You may choose another type if you explain how it's a better choice. You must return your own EncryptedQuery representation that can be used on the device of your choice.

There are 3 possible APIs to implement

  • CPU Integer API (uses tfhe::integer::ServerKey, an EncryptedQuery, a Table struct representing the clear database),
  • GPU Integer API (uses tfhe::integer::CudaServerKey and a CudaEncryptedQuery, a Table struct representing the clear database)
  • High-Level API CPU/GPU (uses tfhe::ServerKey, EncryptedQuery built on HL API types, a Table struct representing the clear database)

The APIs are specified in detail below, choose the one(s) you wish to implement (only implement methods of the API you chose)

CPU Integer API

fn default_cpu_parameters() -> PBSParameters;

fn encrypt_query(query: sqlparser::ast::Select) -> EncryptedQuery;

/// Loads a directory with a structure described above in a format
/// that works for your implementation of the encryted query
fn load_tables(path) -> Tables

/// # Inputs:
/// - sks: The server key to use
/// - input: your EncryptedQuery
/// - tables: the plain data you run the query on
///
/// # Output
/// - EncryptedResult
fn run_fhe_query(
    sks: &tfhe::integer::ServerKey, 
    input: &EncryptedQuery,
    data: &Tables,
) -> EncrypedResult;

/// The output of this function should be a string using the CSV format
/// You should provide a way to compare this string with the output of
/// the clear DB system you use for comparison
fn decrypt_result(clientk_key: &ClientKey, result: &EncryptedResult) -> String

GPU Integer API

fn default_gpu_parameters() -> PBSParameters;

fn encrypt_query_gpu(query: sqlparser::ast::Select) -> CudaEncryptedQuery;

/// Loads a directory with a structure described above in a format
/// that works for your implementation of the encryted query
fn load_tables(path) -> Tables

/// # Inputs:
/// - sks: The server key to use
/// - input: your CudaEncryptedQuery
/// - tables: the plain data you run the query on
///
/// # Output
/// - EncryptedResult
fn run_fhe_query_gpu(
    sks: &tfhe::integer::ServerKey, 
    input: &CudaEncryptedQuery,
    data: &Tables,
) -> EncrypedResult;

/// The output of this function should be a string using the CSV format
/// You should provide a way to compare this string with the output of
/// the clear DB system you use for comparison
fn decrypt_result_gpu(clientk_key: &ClientKey, result: &CudaEncryptedResult) -> String

High-Level API

/// Returns parameters and the device (CudaGpu/Cpu) on which the function /// should be ran
fn default_parameters() -> (PBSParameters, tfhe::Device);

fn load_tables(path) -> Tables

fn encrypt_query(query: String) -> EncryptedQuery;

// The server key is provided to allow setting up threads if needed
fn fhe_query_gpu(
    sks: &tfhe::integer::ServerKey, 
    input: &EncryptedQuery,
    data: &Tables,
) -> EncrypedResult;

fn decrypt_result(clientk_key: &ClientKey, result: &EncrypedResult) -> String

SQL support

We ask to be able to run the following encrypted queries:

Requested: SQL Select: https://www.w3schools.com/sql/sql_select.asp SQL Select Distinct: https://www.w3schools.com/sql/sql_distinct.asp SQL Where: https://www.w3schools.com/sql/sql_where.asp SQL And: https://www.w3schools.com/sql/sql_and.asp SQL Or: https://www.w3schools.com/sql/sql_or.asp SQL Not: https://www.w3schools.com/sql/sql_not.asp SQL In: https://www.w3schools.com/sql/sql_in.asp SQL Between: https://www.w3schools.com/sql/sql_between.asp

For integer values we expect you to manage the <, <=, >, >=, = operators For strings we expect you to manage the = operator only

Note that the != (not equal) operator is built via the NOT operator of SQL and an = operator.

Note that some operators may re-use other implementations, for example the IN operator is indicated to be equivalent to multiple OR operators.

Bonus API:

SQL Joins: https://www.w3schools.com/sql/sql_join.asp

The format of the encrypted query and result are free, the evaluation of this bounty will be on speed and correctness on the Requested APIs, bonus APIs are NOT required and would only be used as a last resort to differentiate two submissions.

Other Details

Benchmarking

Implementations will be benchmarked on the following AWS hardware: CPU: hpc7a.96xlarge (192 vCPU, 768 GB Memory) GPU: p3.2xlarge (note that TFHE-rs only supports single GPU execution for now) (1 Tesla V100, 16 GB GPU Memory)

Parameters

The allowed parameters are:

PARAM_MESSAGE_2_CARRY_2_KS_PBS, PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_2_KS_PBS, PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS

Your solution must be able to run with the multi bit and non multi bit parameter sets, here it should be straightforward as the message and carry moduli are the same.

Note that the multi bit parameters should be faster on GPU, namely the group 3 should be the fastest. It could be the case on CPU as well but it can saturate even the largest CPU machines with threads and performance suffers as a consequence.

Turtle shell queries

As Mario Kart has its turtle shells, you will be able, if you want to, to provide one example of a small DB (with the directory and csv structure discussed earlier) and a “turtle shell” query you feel is very hard/has corner cases alongside your submission, we will run the DB and the query on the submissions we gather, corner cases beware! The performance and correctness on these hunter provided use cases will be used for the evaluation of the bounty, in addition to cases we’ll choose ourselves, we ask that the query runs in under 5 minutes on a laptop/commodity hardware, though our benchmarking machine are larger we want to keep runtimes reasonable when possible.

Good luck!

Reward

🥇Best submission: up to €5,000

To be considered best submission, a solution must be efficient, effective and demonstrate a deep understanding of the core problem. Alongside the technical correctness, it should also be submitted with a clean code, clear explanations and a complete documentation.

🥈Second-best submission: up to €3,000

For a solution to be considered the second best submission, it should be both efficient and effective. The code should be neat and readable, while its documentation might not be as exhaustive as the best submission, it should cover the key aspects of the solution.

🥉Third-best submission: up to €2,000

The third best submission is one that presents a solution that effectively tackles the challenge at hand, even if it may have certain areas of improvement in terms of efficiency or depth of understanding. Documentation should be present, covering the essential components of the solution.

Reward amounts are decided based on code quality and speed performance on a m6i.metal AWS server.

Related links and references

Questions?

Do you have a specific question about this bounty? Simply comment this issue and our team will get back to you!

zaccherinij avatar Feb 09 '24 14:02 zaccherinij

Can we assume that the encrypted query is a tuple of multiple encrypted components parsed from the original SQL query?

I saw a similar approach in the following paper for an equality comparison. Doing so will likely mean loosing some privacy with regard to access patterns. However, we can likely also deduce an access pattern based on using a clear database.

https://eprint.iacr.org/2015/1176.pdf

BraunPhilipp avatar Feb 12 '24 19:02 BraunPhilipp

@BraunPhilipp the nature of the query should remain concealed, the format after encryption is up to you

IceTDrinker avatar Feb 13 '24 09:02 IceTDrinker

fn decrypt_result(result: &EncryptedResult) -> String probably needs a decryption key as input?

As the result is a String, I assume the output is expected to be a string representation of whatever data has been selected?

For example, if we run SELECT column_1_name, colum_2_name FROM table_content WHERE column_3_name="first line" on the example table above (table_content.csv), the output would be "0, true". Correct?

matthiasgeihs avatar Feb 13 '24 15:02 matthiasgeihs

fn decrypt_result(result: &EncryptedResult) -> String probably needs a decryption key as input?

As the result is a String, I assume the output is expected to be a string representation of whatever data has been selected?

For example, if we run SELECT column_1_name, colum_2_name FROM table_content WHERE column_3_name="first line" on the example table above (table_content.csv), the output would be "0, true". Correct?

you are right, that's an oversight, decryption needs the secret key, the string is just whatever equivalent form of the data to easily compare with your cleartext query (my guess is it's easier to stringify the result of the clear query rather than putting the encrypted result formatted as a DB entry)

IceTDrinker avatar Feb 13 '24 15:02 IceTDrinker

If there is multiple rows that match, should we support returning multiple rows? Or just return one of the matching rows?

matthiasgeihs avatar Feb 13 '24 18:02 matthiasgeihs

If there is multiple rows that match, should we support returning multiple rows? Or just return one of the matching rows?

has to match the expected SQL behavior, so you should run the clear query on the SQL system you use as reference and check they are the same, so if the clear query returns several rows, then return the several rows

IceTDrinker avatar Feb 13 '24 18:02 IceTDrinker

I don't think this works. You would need to hold enough space in the return value to hold the complete database. Reason: You may as well select the whole database, but the server can't know because it doesn't see the query. So it always needs to reserve enough space to potentially fit the whole DB into the result. Or am I getting something wrong?

matthiasgeihs avatar Feb 13 '24 18:02 matthiasgeihs

I don't think this works. You would need to hold enough space in the return value to hold the complete database. Reason: You may as well select the whole database, but the server can't know because it doesn't see the query. So it always needs to reserve enough space to potentially fit the whole DB into the result. Or am I getting something wrong?

No I think that if you do it to be completely oblivious that's correct, that's what we are looking at for now, we don't expect to run queries on super large DBs, I mean we know some implementations are impractical though they work (like the ecdsa signature we did a while ago)

IceTDrinker avatar Feb 13 '24 18:02 IceTDrinker

OK, just to be clear: So you actually expect the encrypted result to be large enough to hold the whole DB?

Indeed, that seems to get impractical very quickly considering FHE ciphertext sizes. And thinking about it further: If you always return something that is at least the size of the DB, couldn't you just return the full DB always and let the client do the searching? (If it can handle downloading the full DB, it can surely handle searching on it.) 🤔

matthiasgeihs avatar Feb 13 '24 18:02 matthiasgeihs

fully agree, let's say the ideal setting is that the data is too large to download, but you need to be able to query it without the holder seeing what you query

obviously in an FHE setting because of the ciphertext size it's insane, but there should be improvements on that end, the remaining issue is just that such a query performed in a fully oblivous manner looks to be infeasible while returning reasonably sized data

I guess future bounties or research avenues is to make leaky queries with a protocole allowing to restrict the size of the data returned, but that's not the object of this bounty :)

IceTDrinker avatar Feb 13 '24 18:02 IceTDrinker

here it's a kind of "let's make a toy sized bounty of something we would ideally want to do but that might remain impractical"

Also perhaps for a very smart SQL query the integer layer is not the one to use, but asking some deep crypto knowledge/implems (like dropping to the LWE/GLWE level) is not what bounties are about but more about using the lib, seeing what's doable and what's not doable

IceTDrinker avatar Feb 13 '24 18:02 IceTDrinker

A middle ground could be accepting a cleartext LIMIT on every query:

SELECT * FROM table LIMIT 3

https://www.w3schools.com/sql/sql_top.asp (Note that LIMIT seems to be MySQL syntax, whereas SQL TOP seems to be the general standard.)

has to match the expected SQL behavior, so you should run the clear query on the SQL system you use as reference and check they are the same

Note that SQL results are not guaranteed to have any specific order (if you don't use order by). Hence, you could get different outputs on two different runs of two correct implementations.

matthiasgeihs avatar Feb 14 '24 08:02 matthiasgeihs

hash maps to the rescue for the unspecified order, I don't know about the LIMIT one but we will not feed huge DBs for the bounties, as computation will likely be impractical in terms of runtime anyways

IceTDrinker avatar Feb 14 '24 09:02 IceTDrinker

@BraunPhilipp the nature of the query should remain concealed, the format after encryption is up to you

To clarify the expected work:

  • Are you also expecting the operators like AND, OR, >, <, =, etc. to be encrypted as well ?
  • Must "SELECT DISTINCT" be server-side executed only (no post-processing on the client side) ?

0xalexbel avatar Feb 22 '24 21:02 0xalexbel

execution on the server side for the select distinct, otherwise a way to answer to all queries is to return the whole database under the secret key of the user and let the client process the data, you can see how it's a bit of a loop hole if we accept post processing on the client side

we expect the query to be encrypted, I don't think we have anything about hiding the length of the query for this bounty

IceTDrinker avatar Feb 23 '24 09:02 IceTDrinker

  1. OK : No post-processing what so ever on the client-side
  2. If operators are also encrypted (I guess, this is what you are expecting), then the return value from the server must be some sort of NxP sparse-matrix (where N=number of rows of the DB and P=number of columns of the DB), right ?. It would be up to the client to extract the final result, or am I missing something ?

0xalexbel avatar Feb 23 '24 10:02 0xalexbel

yes for the sparse part you are correct, the FHE server won't be able to diminish the size of the output due to the oblivious nature of the processing, having the client parse the result and ignore rows/results marked as empty is fair game

IceTDrinker avatar Feb 23 '24 11:02 IceTDrinker

  • Should types of operands be encrypted as well ? (I guess, it should be encrypted)
  • If so, the server side code should only manipulate vectors of FhUint8 (ie: the 'EncryptedQuery' struct mentioned above as an example should only contain one single FHE type)? (I'm afraid the final program may practically be running on only tiny tiny DBs)

0xalexbel avatar Feb 23 '24 14:02 0xalexbel

Format of the EncryptedQuery is free, as long as the query is encrypted and therefore protected from prying eyes

IceTDrinker avatar Feb 23 '24 14:02 IceTDrinker

  1. Are we allowed to encode SQL Query numerical values used multiple times with the same FHE-encrypted value (memory-wise)? For example: SELECT CustomerName FROM Customers WHERE NbOrders > 0 AND TotalPayments > 0 The integer 0 appears 2 times. Using the same encrypted value sent from the client to the server would allow additional optimisations on the server.

  2. Are we allowed to encrypt SQL Query values using the smallest possible FheUint type individually for each value ? Doing so may leak value range information for each value parameter sent from the client to the server. However, this would allow additional optimisations on the server. For example: SELECT CustomerName FROM Customers WHERE NbOrders > 0 AND TotalPayments > 12000 can we encode 0 as a FheUint2 and 12000 as a FheUint16 ? or do we have to encode every numerical values using the same exact uniform FHE type.

0xalexbel avatar Feb 26 '24 13:02 0xalexbel

Hmm the encrypted values should match the type of the column I think to hide the encrypted range ?

IceTDrinker avatar Feb 26 '24 14:02 IceTDrinker

Hmm the encrypted values should match the type of the column I think to hide the encrypted range ?

Sure, that would obviously hide any range.

But, that would require the client to know the exact type of each individual columns in the table (an any other tables in case of JOIN), right ?

What about question 1) ?

0xalexbel avatar Feb 26 '24 14:02 0xalexbel

ok that's a good point, I would guess the header of the DB is known ? So that the user knows how to encrypt the data to send

As for 1, I'm not sure yet, will discuss to see what other think internally as to what we want

IceTDrinker avatar Feb 26 '24 14:02 IceTDrinker

for 1: redundant encryption to avoid leaking the fact some data is the same

IceTDrinker avatar Feb 26 '24 15:02 IceTDrinker

After having thought about your answer, it looks to me that a small leaking problem still persists.

Suppose, for example, some table contains only one single column with 'u64' data in it. If the server encounters a FheUint64 somewhere in the encrypted query, the server knows by default which column the client is scanning. That's great from a server-side-optimisation standpoint, but may not be exactly what you are looking for ?

One possible solution would be to encrypt every single SQL Query element using FheUint256 (to include 32-characters ASCII short strings) therefore avoiding any leakage.

I guess you are trying to test the scenario of a SQL server having zero-idea of what the client is looking up in the tables, right? (Confidential SQL query, a very interesting scenario BTW)

0xalexbel avatar Feb 26 '24 15:02 0xalexbel

ah that's a good point, which is super annoying because then it means that anything will always be in the very worst case...

also it will be super slow :|

IceTDrinker avatar Feb 26 '24 16:02 IceTDrinker

ah that's a good point, which is super annoying because then it means that anything will always be in the very worst case...

also it will be super slow :|

No ! probably not.

Downcasting looks super fast (last time I checked), it should be fine (because the server knows the type of the column). However, some pre-processing is required on the client-side before running the query, like throwing errors and check data overflow. It would be much too costly on the server.

Besides, those nice little optimisations I was counting on (which were the subjects of my first 2 questions) are gone ...

0xalexbel avatar Feb 26 '24 16:02 0xalexbel

if the server does not know what columns to look at, it will have to use FheUint256 anyways no ?

IceTDrinker avatar Feb 26 '24 16:02 IceTDrinker

Any idea of what could be the benchmark results on 'FheBool', 'FheUint2', 'FheUint4' using Bitwise ops on AMD EPYC 9R14 CPU @ 2.60GHz and 740GB of RAM ? There seem to be missing in your online doc ? https://docs.zama.ai/tfhe-rs/getting-started/benchmarks#boolean on I am missing something ?

0xalexbel avatar Feb 27 '24 13:02 0xalexbel

Boolean is the boolean API

For an FheBool and an FheUint2 you should have about 12 ms on our bench machine, FheUint4 should be only slighlty slower

IceTDrinker avatar Feb 27 '24 13:02 IceTDrinker