kyuubi [Bug][AuthZ] Kyuubi has no permission to access the Iceberg metadata table after integrating Ranger

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Search before asking

[X] I have searched in the issues and found no similar issues.

Describe the bug

Environment

Spark version：3.2.2 Kyuubi version: apache-kyuubi-1.7.0-SNAPSHOT-bin (master)

./build/dist --tgz --spark-provided --flink-provided -Pspark-3.2

Iceberg version: 0.14.1

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/0.14.1/iceberg-spark-runtime-3.2_2.12-0.14.1.jar

Perform SQL operations

use testdb;
CREATE TABLE testdb.iceberg_tbl (id bigint, data string) USING iceberg;
INSERT INTO testdb.iceberg_tbl VALUES (1, 'a'), (2, 'b'), (3, 'c');
select * from testdb.iceberg_tbl;
+-----+-------+
| id  | data  |
+-----+-------+
| 1   | a     |
| 2   | b     |
| 3   | c     |
+-----+-------+

SELECT * FROM testdb.iceberg_tbl.history;

22/12/07 17:16:37 ERROR ExecuteStatement: Error operating ExecuteStatement: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test_user] does not have [select] privilege on [testdb.iceberg_tbl/history/made_current_at]
        at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:128)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:94)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:93)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.checkPrivileges(RuleAuthorization.scala:93)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:36)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
        at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
        at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
        at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:246)
        at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:215)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
        at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3000)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$2.iterator(ExecuteStatement.scala:107)
        at org.apache.kyuubi.operation.IterableFetchIterator.<init>(FetchIterator.scala:78)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:98)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:90)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$3.run(ExecuteStatement.scala:149)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

For the Iceberg table, it is normal to query some metadata information, such as:

# history
0: jdbc:hive2://xx.xx.xx.xx:10011/default> SELECT * FROM shdw.iceberg_tbl.history;
+--------------------------+----------------------+------------+----------------------+
|     made_current_at      |     snapshot_id      | parent_id  | is_current_ancestor  |
+--------------------------+----------------------+------------+----------------------+
| 2022-05-09 10:58:35.835  | 6955843267870447517  | NULL       | true                 |
+--------------------------+----------------------+------------+----------------------+

# snapshots
0: jdbc:hive2://xx.xx.xx.xx:10011/default> SELECT * FROM shdw.iceberg_tbl.snapshots;
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+
|       committed_at       |     snapshot_id      | parent_id  | operation  |                   manifest_list                    |                      summary                       |
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+
| 2022-05-09 10:58:35.835  | 6955843267870447517  | NULL       | append     | hdfs://cluster1/tgwarehouse/shdw.db/iceberg_tbl/metadata/snap-6955843267870447517-1-e8206624-fbc3-4cf5-b2cb-2db672393253.avro | {"added-data-files":"3","added-files-size":"1929","added-records":"3","changed-partition-count":"1","spark.app.id":"spark-application-1652065040852","total-data-files":"3","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"1929","total-position-deletes":"0","total-records":"3"} |
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+

# history join snapshot 
0: jdbc:hive2://xx.xx.xx.xx:10011/default> select
    h.made_current_at,
    s.operation,
    h.snapshot_id,
    h.is_current_ancestor,
    s.summary['spark.app.id']
from shdw.iceberg_tbl.history h
join shdw.iceberg_tbl.snapshots s
  on h.snapshot_id = s.snapshot_id
order by made_current_at
+--------------------------+------------+----------------------+----------------------+----------------------------------+
|     made_current_at      | operation  |     snapshot_id      | is_current_ancestor  |      summary[spark.app.id]       |
+--------------------------+------------+----------------------+----------------------+----------------------------------+
| 2022-05-09 10:58:35.835  | append     | 6955843267870447517  | true                 | spark-application-1652065040852  |
+--------------------------+------------+----------------------+----------------------+----------------------------------+

Affects Version(s)

1.7.0(master branch)

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

22/12/07 16:53:57 ERROR ExecuteStatement: Error operating ExecuteStatement: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test_user] does not have [select] privilege on [testdb.foo/history/made_current_at]
        at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:128)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:94)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:93)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.checkPrivileges(RuleAuthorization.scala:93)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:36)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
        at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
        at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
        at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:246)
        at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:215)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
        at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3000)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$2.iterator(ExecuteStatement.scala:107)
        at org.apache.kyuubi.operation.IterableFetchIterator.<init>(FetchIterator.scala:78)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:98)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:90)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$3.run(ExecuteStatement.scala:149)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Kyuubi Server Configurations

spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension,org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

[ ] Yes. I can submit a PR independently to fix.
[ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
[ ] No. I cannot submit a PR at this time.

Dec 07 '22 09:12 MLikeWater

cc @bowenliang123 @yaooqinn

Dec 07 '22 09:12 bowenliang123

I don't have a clue how to exclude the metadata tables like history/snapshots from the table identifier. As shown above, the table identifier from select * from iceberg_ns.owner_variable.history is Some(iceberg_ns.owner_variable.history). Whether possible way to check the table is a in Iceberg catalog and then skip the metadata tables.?

Dec 07 '22 10:12 bowenliang123

The metadata tables are enumerable, maybe we can hardcode convert the metadata tables' permission check to the data table?

Dec 07 '22 10:12 pan3793

The metadata tables are enumerable, maybe we can hardcode convert the metadata tables' permission check to the data table?

Yes, but first how to check the real table is an iceberg one?

Dec 07 '22 10:12 bowenliang123

@pan3793 @bowenliang123 Thanks for your support. Different data lake technologies may have different metadata tables. It is possible to judge whether it is a Iceberg or Hudi table from the structure of the created table:

use testdb;
show create table iceberg_tbl;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE spark_catalog.testdb.iceberg_tbl (
  `id` BIGINT,
  `data` STRING)
USING iceberg
LOCATION 'hdfs://cluster1/tgwarehouse/testdb.db/iceberg_tbl'
TBLPROPERTIES(
  'current-snapshot-id' = '4900628243476923676',
  'format' = 'iceberg/parquet',
  'format-version' = '1')
 |
+----------------------------------------------------+

Dec 07 '22 11:12 MLikeWater

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

Dec 07 '22 11:12 yaooqinn

Is this case equivalent to the one that you visit a hive table while you don't have permission to access the HMS table or record, which stores its metadata?

In other words, if we have ALTER privileges to the raw table, we perform ALTER operation on it, and the metadata changes accordingly. This does not mean we need the ALTER privilege to the metadata directly, which results in an ability to falsify critical information.

Dec 07 '22 11:12 yaooqinn

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

@yaooqinn The Iceberg metadata tables, such as history or snapshots, are not stored in Hive metastore, so they cannot be authorized by ranger.

Dec 07 '22 12:12 MLikeWater

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

This could be a workaround. But these tables are more like meta tables rather than metadata tables. For querying situations, these derived tables of source tables could be treated as part of table itself, just like the columns.

Dec 07 '22 12:12 bowenliang123

With further investigation, I think we could tell it's an HistoryTable from an Iceberg table for resolving this. SparkTable and HistoryTable are classes from Iceberg Spark plugin.

Dec 07 '22 12:12 bowenliang123

For querying situations, these derived tables of source tables could be treated as part of table itself, just like the columns.

yes, this happens when you query the raw table, just like the role that metadata plays when you query a hive one, or indexes, snapshots, etc., which other databases may have.

Dec 07 '22 12:12 yaooqinn

Personally, for the Iceberg and Hudi storage formats, the permissions should be simplified when accessing the metadata on the table, that is, the permissions to judge the table metadata depend on the permissions of the table. If the table has access permissions, the metadata should have access permissions. In addition, Ranger does not support the metadata of the data lake storage technology.

Dec 08 '22 03:12 MLikeWater

what's the behavior of Trino/Snowflake(or other popular products)?

Dec 07 '23 12:12 pan3793

Personally, for the Iceberg and Hudi storage formats, the permissions should be simplified when accessing the metadata on the table, that is, the permissions to judge the table metadata depend on the permissions of the table. If the table has access permissions, the metadata should have access permissions. In addition, Ranger does not support the metadata of the data lake storage technology.

agree and we are facing this issue too. maybe we can setup a configuration to decide whether to convert the metadata tables' permission check to the data table or not. saying introducing this as a feature instead of fixing a bug. cc @yaooqinn @pan3793 @bowenliang123

Apr 22 '24 07:04 liaoyt

kyuubi kyuubi copied to clipboard

[Bug][AuthZ] Kyuubi has no permission to access the Iceberg metadata table after integrating Ranger

Code of Conduct

Search before asking

Describe the bug

Environment

Perform SQL operations

Affects Version(s)

Kyuubi Server Log Output

Kyuubi Engine Log Output

Kyuubi Server Configurations

Kyuubi Engine Configurations

Additional context

Are you willing to submit PR?

kyuubi
kyuubi copied to clipboard