iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Spark: Add "Iceberg" prefix to SparkTable name string for better observability of Iceberg tables on SparkUI

Open sumeetgajjar opened this issue 2 years ago • 7 comments

SPARK-40067 added Table#name() to the BatchScan node on SparkUI. With this improvement in place, for all DSv2 tables, SparkUI shows BatchScan <Table#name> instead of uninformative BatchScan.

In its current state, for iceberg table, SparkUI shows BatchScan <catalog_name>.<namespace>.<table>.

This can be further improved by adding the data source name I.e. Iceberg to the table name thereby yielding the final result as BatchScan iceberg <catalog_name>.<namespace>.<table>.

E.g. BatchScan Iceberg spark_catalog.default.ui_test_parquet

Note: SPARK-40067 is available from Spark3.4 and onwards. Since it is not a bug fix, we cannot backport the improvement to OSS Spark3.3. So if one is using Iceberg with OSS Spark3.3 then the improvement won't be visible and uninformative BatchScan would still be shown on the SparkUI unless they backport it in their fork.

Iceberg table in SparkUI with OSS Spark3.3

Screen Shot 2022-08-24 at 6 42 14 PM

Iceberg table in SparkUI with SPARK-40067 backported to OSS Spark3.3

Screen Shot 2022-08-24 at 6 35 19 PM This is good but we still cannot tell if the given table is an Iceberg table with just one look.

Iceberg table in SparkUI with SPARK-40067 backported to OSS Spark3.3 and after this PR

Screen Shot 2022-09-15 at 2 22 31 PM

sumeetgajjar avatar Aug 25 '22 01:08 sumeetgajjar

cc: @rdblue @kbendick @aokolnychyi @danielcweeks @jackye1995

sumeetgajjar avatar Aug 25 '22 01:08 sumeetgajjar

why are these necessary?

zinking avatar Aug 29 '22 01:08 zinking

why are these necessary?

Hi @zinking, this change is required to quickly identify an Iceberg table on SparkUI. Without this change, the user will have to hover over each Node on SparkUI to check if that is an iceberg table.

sumeetgajjar avatar Aug 29 '22 17:08 sumeetgajjar

Updated the PR to remove the default write file format since BatchScan can perform read on various formats. E.g. Append files of type Parquet, and delete files of type Avro.

sumeetgajjar avatar Sep 02 '22 02:09 sumeetgajjar

Updated the PR to just add "Iceberg" prefix in the table name.

sumeetgajjar avatar Sep 15 '22 22:09 sumeetgajjar

Gentle ping to the reviewers: @rdblue @kbendick @aokolnychyi @danielcweeks @jackye1995

sumeetgajjar avatar Sep 15 '22 22:09 sumeetgajjar

Thanks for the comments @wypoon, addressed them in the latest commit.

sumeetgajjar avatar Sep 16 '22 22:09 sumeetgajjar

@szehon-ho @RussellSpitzer can you please review this? The change is very small, but I think it is a nice usability improvement for the Spark UI.

wypoon avatar Jan 10 '23 01:01 wypoon

I could be wrong, but it looks just a bit extra to me to add BatchScan Iceberg catalog.db.table. I feel slightly that DataSource type doesnt seem the most vital information for debugging (I am guessing it will not vary all that much in environments), but on the other hand seems it will always take up some real estate in the middle of the node heade, which already is a bit too long now, dominating the box.

Not very familiar, but could it be easily added in the node's content, like DataSource: Iceberg, along with the other stats like number output rows ?

But its a small subjective concern, if @RussellSpitzer or others are ok with it here, I'm fine

szehon-ho avatar Jan 10 '23 02:01 szehon-ho

For native data sources such as parquet, avro, orc, etc, the Spark UI shows a label like "Scan parquet db.table" for the node. So the data source name/type is shown in the label. I think showing "Iceberg" in the node is helpful.

wypoon avatar Jan 10 '23 19:01 wypoon

The stats come from SQLMetrics that Spark supports for the Scan (or custom SQLMetrics that are implemented for the data source). Adding something like "DataSource: Iceberg" to this part of the UI would require changes in Spark itself. (As a hack, it could be added as a "custom metric", but it really is not a metric.)

wypoon avatar Jan 10 '23 19:01 wypoon

Didnt know Spark already makes "orc table" and "parquet table" as table names for those file formats. In that case, its ok with me then, thanks.

szehon-ho avatar Jan 10 '23 23:01 szehon-ho

Merged, thanks @sumeetgajjar , @wypoon

szehon-ho avatar Jan 11 '23 18:01 szehon-ho

Thanks @wypoon and @szehon-ho for the reviews and for merging the PR.

sumeetgajjar avatar Jan 11 '23 21:01 sumeetgajjar

This seems a bit odd to me, to be honest. Why not use scan.description() in Spark for this?

aokolnychyi avatar Feb 17 '23 22:02 aokolnychyi

This seems a bit odd to me, to be honest. Why not use scan.description() in Spark for this?

@aokolnychyi SparkScan#description() implemented in iceberg does not return a Table name but a lot of other info https://github.com/apache/iceberg/blob/3efaee1790fe541d0969e12fa24afa4e1a3041c3/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java#L161-L171

Futhermore, scan#description() from FileScan in Spark also returns a lot more info including Tablename thus it made sense to use Table#name()

https://github.com/apache/spark/blob/0b9ed26e48248aa58642b3626a02dd8c89a01afb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala#L110-L130

sumeetgajjar avatar Feb 24 '23 00:02 sumeetgajjar