[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table
Purpose
~~This relocates the 'path' property in the table options to 'location' for better presentation.~~
- Adapt 'path' to Spark's 'location' in table props ( the 'path' is still preserved in table props)
- Support the customization of the table location when creating a table(only allowed for creating external table using hive catalog)
Spark uses the reserved property location to indicate the location of the table, and in the output of DESC EXTENDED, the location information will be displayed under the "# Detailed Table Information" section for better visibility.
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|col_name |data_type |comment|
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|a |bigint |NULL |
|b |varchar(10) |NULL |
|c |char(10) |NULL |
| | | |
|# Metadata Columns | | |
|__paimon_file_path |string | |
|__paimon_row_index |bigint | |
| | | |
|# Detailed Table Information| | |
|Name |default.testTableAs | |
|Type |MANAGED | |
|Location |file:/var/folders/2r/v_2n6mbj41v7q14m8f3j9q4w0000gn/T/junit3802231298897594813/default.db/testTableAs| |
|Provider |paimon | |
|Owner |zhongyujiang | |
|Table Properties |[file.format=parquet] | |
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
Tests
API and Format
Documentation
cc @Zouxxyy @YannByron Can you help review this? Thanks!
Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.
the interoperability between engines
Are you suggesting that you want users to retrieve the table location information from the path option within the options when using the DESC command in different engines?
However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark).
Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users.
BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR.
Thanks @zhongyujiang for update. Left some comments.
I made some update, summary:
- Support
CREATE TABLE x LOCATION 'xxx'when using hive catalog (for other catalog, like file system catalog, an exception will be throw). The table created this way will be treated as external table. - Drop a managed table, the data files will be deleted.
- Drop a external table, the data files will not be deleted.
CC @zhongyujiang @JingsongLi
@Zouxxyy @JingsongLi Thanks for updating this, I'm sorry that I am not following up on this PR in time.