paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[Spark]: Adapting 'path' to Spark's 'location' in table props and supporting the customization of the table location when creating a table

Open zhongyujiang opened this issue 1 year ago • 3 comments

Purpose

~~This relocates the 'path' property in the table options to 'location' for better presentation.~~

  • Adapt 'path' to Spark's 'location' in table props ( the 'path' is still preserved in table props)
  • Support the customization of the table location when creating a table(only allowed for creating external table using hive catalog)

Spark uses the reserved property location to indicate the location of the table, and in the output of DESC EXTENDED, the location information will be displayed under the "# Detailed Table Information" section for better visibility.

+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                            |comment|
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+
|a                           |bigint                                                                                               |NULL   |
|b                           |varchar(10)                                                                                          |NULL   |
|c                           |char(10)                                                                                             |NULL   |
|                            |                                                                                                     |       |
|# Metadata Columns          |                                                                                                     |       |
|__paimon_file_path          |string                                                                                               |       |
|__paimon_row_index          |bigint                                                                                               |       |
|                            |                                                                                                     |       |
|# Detailed Table Information|                                                                                                     |       |
|Name                        |default.testTableAs                                                                                  |       |
|Type                        |MANAGED                                                                                              |       |
|Location                    |file:/var/folders/2r/v_2n6mbj41v7q14m8f3j9q4w0000gn/T/junit3802231298897594813/default.db/testTableAs|       |
|Provider                    |paimon                                                                                               |       |
|Owner                       |zhongyujiang                                                                                         |       |
|Table Properties            |[file.format=parquet]                                                                                |       |
+----------------------------+-----------------------------------------------------------------------------------------------------+-------+

Tests

API and Format

Documentation

zhongyujiang avatar Jul 30 '24 04:07 zhongyujiang

cc @Zouxxyy @YannByron Can you help review this? Thanks!

zhongyujiang avatar Jul 30 '24 04:07 zhongyujiang

Hi @JingsongLi This is not introducing a new option, but rather better adapting to the Spark engine. Because Spark has always used it to represent the location of the table.

the interoperability between engines

Are you suggesting that you want users to retrieve the table location information from the path option within the options when using the DESC command in different engines?

However, I would like to point out that the meaning of the DESC command inherently varies across different engines, closely related to the functionality of the engine. For instance, the DESC command in Trino and Flink does not even display table options information, but only column information. Yet, Flink can display primary key attribute of columns because primary keys are part of the Flink specification (this is not the case in Spark).

Therefore, I believe we should also adapt this to Spark, which would be more in line with the usage habits of Spark users.

BTW, for Iceberg and Hive tables, Spark's DESC EXTENDED DDL command displays the location information through the location field. However, for Paimon tables, the location information is hidden within the options under the path field, making it less convenient to users. This is the motivation behind this PR.

zhongyujiang avatar Jul 30 '24 11:07 zhongyujiang

Thanks @zhongyujiang for update. Left some comments.

JingsongLi avatar Aug 06 '24 04:08 JingsongLi

I made some update, summary:

  1. Support CREATE TABLE x LOCATION 'xxx' when using hive catalog (for other catalog, like file system catalog, an exception will be throw). The table created this way will be treated as external table.
  2. Drop a managed table, the data files will be deleted.
  3. Drop a external table, the data files will not be deleted.

CC @zhongyujiang @JingsongLi

Zouxxyy avatar Nov 20 '24 02:11 Zouxxyy

@Zouxxyy @JingsongLi Thanks for updating this, I'm sorry that I am not following up on this PR in time.

zhongyujiang avatar Nov 26 '24 10:11 zhongyujiang