kuzu icon indicating copy to clipboard operation
kuzu copied to clipboard

Attach remote kuzu database design

Open acquamarin opened this issue 9 months ago • 1 comments

Attach remote database design

Design goal: Allow user to execute read-only queries on a remote kuzu database which can be stored on S3 and accessed via https. We are not going to allow users to attach multiple kuzu instances at a time, and users are not allowed to query the original local kuzu database after attaching. The reason why we set this restriction is to keep the feature simple and straightforward.

Prerequisite

The httpfs extension should be installed and loaded beforehand using:

INSTALL httpfs;
LOAD EXTENSION httpfs;

Feature:

1. User can attach/detach a remote database using attach/detach statement as follows:

Attach:

ATTACH 'S3://kuzu_data/db' as uw (type kuzu)

Detach:

DETACH uw

a. After attaching a remote database, users can no longer query/access the original local kuzu database.

b. Users can only query the original local kuzu database after issuing the detach statement.

c. Only one remote database can be attached at a time. meaning:

ATTACH 'S3://kuzu_data/db' AS uw (type kuzu)
ATTACH 'S3://kuzu_data/db1' AS uw1 (type kuzu)

The second attach statement will cause an exception.

d. The Kuzu database will be attached in read-only mode, meaning write queries will be rejected. If the database path is an invalid Kuzu path, an exception will also be raised.

Implementation

  • Attach: After the user issues the attach statement, we reconstruct the fields in the database class using the provided remote file path.
  • Detach: After the user issues the detach statement, we reconstruct the fields in the database class using the original file path (the one provided on startup).

2. Querying

We allow users to attach a remote kuzu database and query it the same way as they would a local kuzu database.

Example:

// Assume person is a valid table in uw database
ATTACH 'S3://kuzu_data/db' AS uw (type kuzu)
MATCH (p:person) RETURN p.*

[!NOTE] Write queries are not allowed.

3. Benchmark Result

I am calling the HTTPFileSystem::read() and LocalFileSystem() read APIs to peform a sequential read of ldbc100 comment.csv file (21GB) to evaluate the read peformance under different filesystems.(results measured in seconds) Block size means: the number of bytes to read for each FileSystem::read() call.

Block size http file system read time (sec) local file system read time (sec)
4KB 145 5
64KB 86 3
256KB 87 3
512KB 105 3
1MB 86 3
4MB 32 3

Downloading/Caching Optimization

Let us also have a version where users can choose to download the entire database from s3 with a single s3 call to a local temporary directory, so we cache the remote database locally. I think we should keep a map of which remote directory has been mapped to which local directory, so that if a user attaches and detaches multiple times during a session, we don't keep downloading these remote databases over and over again.

Manually deleting cached database: Users may want to refresh a cached database from time to time. We can provide a command to "clear the cached remote Kuzu db". That is we can have a function similar to clearning Postgres/DuckDB schema cache CALL clear_cached_remote_kuzudb(S3://kuzu_data/db) RETURN *; This function should be called after the user detaches from S3://kuzu_data/db. Otherwise it should fail. If the user has detached from this database, then this function can remove the local directory to which we cached the database.

When to delete the locally cached database automatically: The rule should be that if any behavior of the system leads to the extension to be uninstalled then we should delete the local temporary directory. For example, if crashing means that the remote Kuzu extension (or https extension) is uninstalled, then we must have deleted all data related to the extension during crashing or during recovery.

Warning/feedback when user attaches to a cached database: When user calls ATTACH 'S3://kuzu_data/db' AS uw, if this database has been cached, it would be good to give some feedback to the user: "Attaching to the cached version of the database S3://kuzu_data/db in /tmp/kuzu/attached-kuzu-dbs/uw/. So the user knows that they have attached to a local copy and can manually remove the cached version if they have to.

We should write tests to test these manual and automatic cache removals and write documentations for these cases.

acquamarin avatar May 09 '24 01:05 acquamarin