kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[Bug][AuthZ] Kyuubi has no permission to access the Iceberg metadata table after integrating Ranger

Open MLikeWater opened this issue 2 years ago • 14 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the bug

Environment

Spark version:3.2.2 Kyuubi version: apache-kyuubi-1.7.0-SNAPSHOT-bin (master)

./build/dist --tgz --spark-provided --flink-provided -Pspark-3.2

Iceberg version: 0.14.1

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/0.14.1/iceberg-spark-runtime-3.2_2.12-0.14.1.jar

Perform SQL operations

use testdb;
CREATE TABLE testdb.iceberg_tbl (id bigint, data string) USING iceberg;
INSERT INTO testdb.iceberg_tbl VALUES (1, 'a'), (2, 'b'), (3, 'c');
select * from testdb.iceberg_tbl;
+-----+-------+
| id  | data  |
+-----+-------+
| 1   | a     |
| 2   | b     |
| 3   | c     |
+-----+-------+

SELECT * FROM testdb.iceberg_tbl.history;

22/12/07 17:16:37 ERROR ExecuteStatement: Error operating ExecuteStatement: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test_user] does not have [select] privilege on [testdb.iceberg_tbl/history/made_current_at]
        at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:128)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:94)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:93)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.checkPrivileges(RuleAuthorization.scala:93)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:36)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
        at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
        at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
        at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:246)
        at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:215)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
        at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3000)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$2.iterator(ExecuteStatement.scala:107)
        at org.apache.kyuubi.operation.IterableFetchIterator.<init>(FetchIterator.scala:78)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:98)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:90)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$3.run(ExecuteStatement.scala:149)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

For the Iceberg table, it is normal to query some metadata information, such as:

# history
0: jdbc:hive2://xx.xx.xx.xx:10011/default> SELECT * FROM shdw.iceberg_tbl.history;
+--------------------------+----------------------+------------+----------------------+
|     made_current_at      |     snapshot_id      | parent_id  | is_current_ancestor  |
+--------------------------+----------------------+------------+----------------------+
| 2022-05-09 10:58:35.835  | 6955843267870447517  | NULL       | true                 |
+--------------------------+----------------------+------------+----------------------+

# snapshots
0: jdbc:hive2://xx.xx.xx.xx:10011/default> SELECT * FROM shdw.iceberg_tbl.snapshots;
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+
|       committed_at       |     snapshot_id      | parent_id  | operation  |                   manifest_list                    |                      summary                       |
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+
| 2022-05-09 10:58:35.835  | 6955843267870447517  | NULL       | append     | hdfs://cluster1/tgwarehouse/shdw.db/iceberg_tbl/metadata/snap-6955843267870447517-1-e8206624-fbc3-4cf5-b2cb-2db672393253.avro | {"added-data-files":"3","added-files-size":"1929","added-records":"3","changed-partition-count":"1","spark.app.id":"spark-application-1652065040852","total-data-files":"3","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"1929","total-position-deletes":"0","total-records":"3"} |
+--------------------------+----------------------+------------+------------+----------------------------------------------------+----------------------------------------------------+

# history join snapshot 
0: jdbc:hive2://xx.xx.xx.xx:10011/default> select
    h.made_current_at,
    s.operation,
    h.snapshot_id,
    h.is_current_ancestor,
    s.summary['spark.app.id']
from shdw.iceberg_tbl.history h
join shdw.iceberg_tbl.snapshots s
  on h.snapshot_id = s.snapshot_id
order by made_current_at
+--------------------------+------------+----------------------+----------------------+----------------------------------+
|     made_current_at      | operation  |     snapshot_id      | is_current_ancestor  |      summary[spark.app.id]       |
+--------------------------+------------+----------------------+----------------------+----------------------------------+
| 2022-05-09 10:58:35.835  | append     | 6955843267870447517  | true                 | spark-application-1652065040852  |
+--------------------------+------------+----------------------+----------------------+----------------------------------+

Affects Version(s)

1.7.0(master branch)

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

22/12/07 16:53:57 ERROR ExecuteStatement: Error operating ExecuteStatement: org.apache.kyuubi.plugin.spark.authz.AccessControlException: Permission denied: user [test_user] does not have [select] privilege on [testdb.foo/history/made_current_at]
        at org.apache.kyuubi.plugin.spark.authz.ranger.SparkRangerAdminPlugin$.verify(SparkRangerAdminPlugin.scala:128)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5(RuleAuthorization.scala:94)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.$anonfun$checkPrivileges$5$adapted(RuleAuthorization.scala:93)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization$.checkPrivileges(RuleAuthorization.scala:93)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:36)
        at org.apache.kyuubi.plugin.spark.authz.ranger.RuleAuthorization.apply(RuleAuthorization.scala:33)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
        at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
        at scala.collection.immutable.List.foreach(List.scala:431)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
        at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:125)
        at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
        at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:121)
        at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:117)
        at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:135)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:153)
        at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:150)
        at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:201)
        at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:246)
        at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:215)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
        at org.apache.spark.sql.Dataset.toLocalIterator(Dataset.scala:3000)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$2.iterator(ExecuteStatement.scala:107)
        at org.apache.kyuubi.operation.IterableFetchIterator.<init>(FetchIterator.scala:78)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.$anonfun$executeStatement$1(ExecuteStatement.scala:106)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.kyuubi.engine.spark.operation.SparkOperation.withLocalProperties(SparkOperation.scala:98)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement.org$apache$kyuubi$engine$spark$operation$ExecuteStatement$$executeStatement(ExecuteStatement.scala:90)
        at org.apache.kyuubi.engine.spark.operation.ExecuteStatement$$anon$3.run(ExecuteStatement.scala:149)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Kyuubi Server Configurations

spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension,org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type=hive

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • [ ] Yes. I can submit a PR independently to fix.
  • [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • [ ] No. I cannot submit a PR at this time.

MLikeWater avatar Dec 07 '22 09:12 MLikeWater

cc @bowenliang123 @yaooqinn

bowenliang123 avatar Dec 07 '22 09:12 bowenliang123

image

I don't have a clue how to exclude the metadata tables like history/snapshots from the table identifier. As shown above, the table identifier from select * from iceberg_ns.owner_variable.history is Some(iceberg_ns.owner_variable.history). Whether possible way to check the table is a in Iceberg catalog and then skip the metadata tables.?

bowenliang123 avatar Dec 07 '22 10:12 bowenliang123

The metadata tables are enumerable, maybe we can hardcode convert the metadata tables' permission check to the data table?

pan3793 avatar Dec 07 '22 10:12 pan3793

The metadata tables are enumerable, maybe we can hardcode convert the metadata tables' permission check to the data table?

Yes, but first how to check the real table is an iceberg one?

bowenliang123 avatar Dec 07 '22 10:12 bowenliang123

@pan3793 @bowenliang123 Thanks for your support. Different data lake technologies may have different metadata tables. It is possible to judge whether it is a Iceberg or Hudi table from the structure of the created table:

use testdb;
show create table iceberg_tbl;
+----------------------------------------------------+
|                   createtab_stmt                   |
+----------------------------------------------------+
| CREATE TABLE spark_catalog.testdb.iceberg_tbl (
  `id` BIGINT,
  `data` STRING)
USING iceberg
LOCATION 'hdfs://cluster1/tgwarehouse/testdb.db/iceberg_tbl'
TBLPROPERTIES(
  'current-snapshot-id' = '4900628243476923676',
  'format' = 'iceberg/parquet',
  'format-version' = '1')
 |
+----------------------------------------------------+

MLikeWater avatar Dec 07 '22 11:12 MLikeWater

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

yaooqinn avatar Dec 07 '22 11:12 yaooqinn

Is this case equivalent to the one that you visit a hive table while you don't have permission to access the HMS table or record, which stores its metadata?

In other words, if we have ALTER privileges to the raw table, we perform ALTER operation on it, and the metadata changes accordingly. This does not mean we need the ALTER privilege to the metadata directly, which results in an ability to falsify critical information.

yaooqinn avatar Dec 07 '22 11:12 yaooqinn

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

@yaooqinn The Iceberg metadata tables, such as history or snapshots, are not stored in Hive metastore, so they cannot be authorized by ranger.

MLikeWater avatar Dec 07 '22 12:12 MLikeWater

why not just grant select privilege to the user who access testdb.iceberg_tbl.history?

This could be a workaround. But these tables are more like meta tables rather than metadata tables. For querying situations, these derived tables of source tables could be treated as part of table itself, just like the columns.

bowenliang123 avatar Dec 07 '22 12:12 bowenliang123

image

With further investigation, I think we could tell it's an HistoryTable from an Iceberg table for resolving this. SparkTable and HistoryTable are classes from Iceberg Spark plugin.

bowenliang123 avatar Dec 07 '22 12:12 bowenliang123

For querying situations, these derived tables of source tables could be treated as part of table itself, just like the columns.

yes, this happens when you query the raw table, just like the role that metadata plays when you query a hive one, or indexes, snapshots, etc., which other databases may have.

yaooqinn avatar Dec 07 '22 12:12 yaooqinn

Personally, for the Iceberg and Hudi storage formats, the permissions should be simplified when accessing the metadata on the table, that is, the permissions to judge the table metadata depend on the permissions of the table. If the table has access permissions, the metadata should have access permissions. In addition, Ranger does not support the metadata of the data lake storage technology.

MLikeWater avatar Dec 08 '22 03:12 MLikeWater

what's the behavior of Trino/Snowflake(or other popular products)?

pan3793 avatar Dec 07 '23 12:12 pan3793

Personally, for the Iceberg and Hudi storage formats, the permissions should be simplified when accessing the metadata on the table, that is, the permissions to judge the table metadata depend on the permissions of the table. If the table has access permissions, the metadata should have access permissions. In addition, Ranger does not support the metadata of the data lake storage technology.

agree and we are facing this issue too. maybe we can setup a configuration to decide whether to convert the metadata tables' permission check to the data table or not. saying introducing this as a feature instead of fixing a bug. cc @yaooqinn @pan3793 @bowenliang123

liaoyt avatar Apr 22 '24 07:04 liaoyt