iceberg
iceberg copied to clipboard
Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI
SPARK-40067 added Table#name() to the BatchScan node on SparkUI.
With this improvement in place, for all DSv2 tables, SparkUI shows
BatchScan <Table#name>
instead of uninformative BatchScan
.
In its current state, for iceberg table, SparkUI shows
BatchScan <catalog_name>.<namespace>.<table>
.
This can be further improved by adding the data source name I.e. Iceberg to the table name thereby yielding the final result as
BatchScan iceberg <catalog_name>.<namespace>.<table>
.
E.g. BatchScan Iceberg spark_catalog.default.ui_test_parquet
Note: SPARK-40067 is available from Spark3.4 and onwards. Since it is not a bug fix, we cannot backport the improvement to OSS Spark3.3. So if one is using Iceberg with OSS Spark3.3 then the improvement won't be visible and uninformative BatchScan
would still be shown on the SparkUI unless they backport it in their fork.
Iceberg table in SparkUI with OSS Spark3.3
Iceberg table in SparkUI with SPARK-40067 backported to OSS Spark3.3
This is good but we still cannot tell if the given table is an Iceberg table with just one look.
Iceberg table in SparkUI with SPARK-40067 backported to OSS Spark3.3 and after this PR

cc: @rdblue @kbendick @aokolnychyi @danielcweeks @jackye1995
why are these necessary?
why are these necessary?
Hi @zinking, this change is required to quickly identify an Iceberg table on SparkUI. Without this change, the user will have to hover over each Node on SparkUI to check if that is an iceberg table.
Updated the PR to remove the default write file format since BatchScan
can perform read on various formats. E.g. Append files of type Parquet, and delete files of type Avro.
Updated the PR to just add "Iceberg" prefix in the table name.
Gentle ping to the reviewers: @rdblue @kbendick @aokolnychyi @danielcweeks @jackye1995
Thanks for the comments @wypoon, addressed them in the latest commit.
@szehon-ho @RussellSpitzer can you please review this? The change is very small, but I think it is a nice usability improvement for the Spark UI.
I could be wrong, but it looks just a bit extra to me to add
BatchScan Iceberg catalog.db.table.
I feel slightly that DataSource type doesnt seem the most vital information for debugging (I am guessing it will not vary all that much in environments), but on the other hand seems it will always take up some real estate in the middle of the node heade, which already is a bit too long now, dominating the box.
Not very familiar, but could it be easily added in the node's content, like DataSource: Iceberg, along with the other stats like number output rows ?
But its a small subjective concern, if @RussellSpitzer or others are ok with it here, I'm fine
For native data sources such as parquet, avro, orc, etc, the Spark UI shows a label like "Scan parquet db.table" for the node. So the data source name/type is shown in the label. I think showing "Iceberg" in the node is helpful.
The stats come from SQLMetrics that Spark supports for the Scan (or custom SQLMetrics that are implemented for the data source). Adding something like "DataSource: Iceberg" to this part of the UI would require changes in Spark itself. (As a hack, it could be added as a "custom metric", but it really is not a metric.)
Didnt know Spark already makes "orc table" and "parquet table" as table names for those file formats. In that case, its ok with me then, thanks.
Merged, thanks @sumeetgajjar , @wypoon
Thanks @wypoon and @szehon-ho for the reviews and for merging the PR.
This seems a bit odd to me, to be honest. Why not use scan.description()
in Spark for this?
This seems a bit odd to me, to be honest. Why not use
scan.description()
in Spark for this?
@aokolnychyi SparkScan#description()
implemented in iceberg does not return a Table name but a lot of other info
https://github.com/apache/iceberg/blob/3efaee1790fe541d0969e12fa24afa4e1a3041c3/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L161-L171
Futhermore, scan#description()
from FileScan
in Spark also returns a lot more info including Tablename
thus it made sense to use Table#name()
https://github.com/apache/spark/blob/0b9ed26e48248aa58642b3626a02dd8c89a01afb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L110-L130