Two ducklakes, one catalog
I’d like to support more than one duck lake from a single catalog database (provider agnostic).
Is there a way to do this? Perhaps by prefixing the table names with a unique identifier for each duck lake?
Eg tenant1_ducklake_metadata
maybe different schemas if you use Postgres as metadata backend, check out https://github.com/duckdb/ducklake/issues/193
Yes, the idea is to do this using schemas
Thanks for the suggestions. Would this work regardless of provider? As I understand it, it's up to the storage implementation as to whether or not they support schemas in this sense.
Our use case is to have local development environments use DuckDb as the catalog storage, and then something else (probably Postgres) elsewhere.
If this isn't possible, would a feature request be considered to allow prefixing the name of the tables? Or maybe some other partitioning mechanism that would work regardless of backend?
I was wondering this same thing, a common pattern is to implement "layers" of datalake (like raw, staging, bronze, silver, gold, dev, etc). I'm curious if this couldn't be another table in the metadata database rather than needing to create additional metadata databases per logical "database".
For context, I'm working on a studio that would use a pg cluster as the application layer and then be able to create multiple databases each with many schemas. Ideally the studio application could bootstrap the additional databases by talking to pg directly, this would be easier if it was adding a row, rather than init a whole new metadata database.
It's not impossible with the current architecture, and I'm not sure of the implications to the extension.
As mentioned you can already store multiple DuckLakes in the same catalog database using the METADATA_SCHEMA parameter. This works with back-ends that support schemas - i.e. it works with DuckDB/Postgres, but does not work with MySQL/SQLite. We could consider adding prefix support to the metadata tables to allow this for all back-ends.
I was wondering this same thing, a common pattern is to implement "layers" of datalake (like raw, staging, bronze, silver, gold, dev, etc). I'm curious if this couldn't be another table in the metadata database rather than needing to create additional metadata databases per logical "database".
Can't you just create a different schema for bronze, silver, and gold? (Ducklake schema, not metadata catalog schema)
Maybe I'm misunderstanding, but when I pass in the METADATA_SCHEMA with a different name, I end up with two metadata databases in pg.
What I thought might be a good design decision is to maintain a single metadata database which can handle multiple logical databases. Perhaps that's possible right now and I'm just "doing it wrong". It also might be a terrible decision for reasons I don't understand yet.
But yes this is not a blocking problem as I can spawn as many databases, with schemas as I need, just each one has it's own __ducklake_metadata_{database} database established, that houses all the metadata for that database.
In my mind, this might look something like an additional table defined in the metadata catalog called databases which has some basic information, (id, uuid, name, storage path, ...) then the schemas table has an additional field for database_id, I'm unsure if other tables would need a direct reference, probably not?
Then on ATTACH (or however other engines might parse this in future) the client can just scan for the specific database it's looking to mount, then connect to the rest). I think it doesn't really simplify anything for the client, but I think it makes the catalog design slightly cleaner?
Edited after I mulled over what I was imagining to try to make it more clear.
@mitchelljphayes Perhaps you are confusing databases and schema? In pg, you can have multiple databases, and each database can have multiple schema.
Here's an example where I have created the database scratch and used the schema foo_meta.
ATTACH 'ducklake:postgres:dbname=scratch' (DATA_PATH 'foo.files', METADATA_SCHEMA 'foo_meta');
You can see in pg, attaching to database scratch, seeing the schema foo_meta.
mh@:~ $ psql postgresql://localhost/scratch
psql (14.15 (Homebrew))
Type "help" for help.
scratch=# \dn
List of schemas
Name | Owner
----------+--------------
foo_meta | markharrison
public | markharrison
@marhar that wasn't quite my confusion, but it was related. So thank you!
I'm connecting to the catalog via a pg proxy which is then being passed to a duckdb client, so I had two psql instances running one connected to the catalog pg directly and one connected to the ducklake and I just got the outputs confused and thought I was creating additional metadata databases in the catalog, but I was seeing the duckdb metadata databases.
Seems I needed some rest.
Thanks for the assistance and patience :)
I have a question. When we create a data lake, each user can create their own Delta.io table. If we manage it uniformly using DB, and if a data lake has many departmental permissions and individuals creating library tables, how can we manage permissions.
I have a question. When we create a data lake, each user can create their own Delta.io table. If we manage it uniformly using DB, and if a data lake has many departmental permissions and individuals creating library tables, how can we manage permissions.
Not a maintainer, but a newcomer to ducklake with similar questions, and I think this is essentially considered out-of-scope for ducklake itself, but you could use e.g. s3 bucket access policies etc, as well as making sure users don't access the catalog directly, as some actual data (not just metadata) is visible in the catalog itself, and, if using encryption, the catalog is essentially functioning as a KMS/Secret Manager as well. There is also the notion of "frozen lakes", i.e. you can make read-only lakes for ad-hoc purposes, e.g. grabbing a subset of the main datalake to expose to some team in read-only mode. In that case as well, I think sophisticated management of s3 access policies would be the key to accomplishing this without copying data around excessively.