duckdb_iceberg
duckdb_iceberg copied to clipboard
Restrict reading client to AWS table namespace
Background
AWS S3 table buckets has a concept called 'namespaces'. Each table may belong to a namespace, and namespaces can be used to separate data for different departments or customers.
The namespace concept is part of the Iceberg REST Catalog spec, and not unique to AWS.
Here are examples of a few endpoints with namespaces:
GET /v1/{prefix}/namespaces/{namespace}/tables(list tables)POST /v1/{prefix}/namespaces/{namespace}/tables(create table)GET /v1/{prefix}/namespaces/{namespace}/tables/{table}(load table)POST /v1/{prefix}/namespaces/{namespace}/tables/{table}(update table)
DuckDB today has mechnisms for restricting the file paths the SQL layer can read, for example:
-- https://duckdb.org/docs/stable/operations_manual/securing_duckdb/overview.html#the-allowed_directories-and-allowed_paths-options
SET allowed_directories = ['/tmp'];
-- With the setting applied, DuckDB will refuse to read files outside /tmp.
Problem/desire
It would be nice if the extension allowed us to specify a namespace "outside" SQL, such that it would provide us with a security scoping guarantee, i.e. that no SQL might ever read outside the specified table namespace.
For example, the extension docs today has this snippet:
-- https://duckdb.org/docs/stable/extensions/iceberg/amazon_s3_tables
SELECT count(*)
FROM s3_tables.namespace_name.table_name;
But this would allow user-controlled SQL to read from any namespace.
Some kind of security setting like this would solve it:
-- https://duckdb.org/docs/stable/operations_manual/securing_duckdb/overview#locking-configurations
SET iceberg_allowed_namespaces = ['a', 'b'];
Such that e.g. SHOW ALL TABLES; would only list tables under that namespace, etc.
I guess we'd also need a way to lockdown ATTACH and duckdb_secrets().
Benefits
This would greatly simplify authorization, such that a bucket might contain data that belong to different groups/companies/departments, and that we could lock down the DuckDB engine to (while it's running) only allow connections to permitted namespaces.
Resources
- https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-namespace.html
- https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-open-source.html
- https://iceberg.apache.org/terms/
Let me know if I should elaborate!
I don't exactly follow, I guess you're talking about nested namespaces which we don't currently support? I'm not sure how that relates to authorization and security, that lies in the hands of the REST Catalog server providing correctly scoped secrets (through vended credentials), no?
@Tishj Thanks for answering!
Not nested necessarily, this would go for single-layer namespaces too (afaik AWS doesn't even supported nested namespaces).
I've updated the top comment with more details, I hope that better explains the background and problem!
Hmm okay I understand now, but that doesn't sound like it should be DuckDB's responsibility. This sounds like the S3 credentials created should have their scope limited to that namespace, so other namespaces can't be interacted with by those credentials.
It's not actually securing anything if all you're doing is preventing DuckDB from touching the other namespaces.
Yes, creating scoped credentials via AWS is another venue (assuming that works, haven't dug into the details).
I guess one way to look at it is, are these namespaces similar to DuckDBs file path restrictions (listed under security in DuckDB docs) and if so should this extension provide something similar?
Or is the view that this is inherently different and that it's not a DuckDB concern?
https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-resource-based-policies.html