hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[HUDI-7580] Fix order of fields when records inserted out of order

Open codope opened this issue 1 year ago • 5 comments

Change Logs

Fix order of fields when records inserted out of order.

Impact

Bug fix, no public api change.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

Contributor's checklist

  • [ ] Read through contributor's guide
  • [ ] Change Logs and Impact were stated clearly
  • [ ] Adequate tests were added if applicable
  • [ ] CI passed

codope avatar Apr 15 '24 05:04 codope

Wouldn't it be better to fail create table if the partition columns are not at the end? This seems like it could lead to more confusion since you still need to do the insert into with the partition columns at the end.

Spark-sql has no such restriction, so I don't think it's a good idea to fail create table if the partition columns are not at the end. Also, note that the issue was fixed in Hudi 0.11.0 in https://github.com/apache/hudi/commit/ea1fbc71ecf9eec20815e2bc51bb8489e4ca1804, and unfortunately regressed in https://github.com/apache/hudi/commit/cfd0c1ee34460332053491fd1e68c2607c14e958. This PR is merely trying to restore the behavior of Hudi 0.11.0 while still keeping the perf improvements made by the latter commit intact.

Also, how would this affect the other sql commands?

This change is only in InsertIntoHoodieTableCommand, and should not affect any sql command other than INSERT INTO and CREATE TABLE AS SELECT.

codope avatar Apr 16 '24 06:04 codope

From your test

        spark.sql(
          s"""
             |create table $tableName (
             |  id int,
             |  name string,
             |  price int,
             |  ts long
             |) using hudi
             | options (
             |  primaryKey ='id',
             |  type = 'mor',
             |  preCombineField = 'ts'
             | )
             | partitioned by(price)
             | location '$basePath'
       """.stripMargin)
        spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
        spark.sql(s"insert into $tableName values(2, 'a2', 100, 200000)")
        checkAnswer(s"select id, name, ts, price from $tableName")(
          Seq(1, "a1", 10, 1000),
          Seq(2, "a2", 100, 200000)
        )

we say the schema is "id, name, price, ts". But then in the insert "1, 'a1', 10, 1000" the 1000 corresponds to price. So the columns are not in the order they are defined in create table.

jonvex avatar Apr 17 '24 01:04 jonvex

@jonvex Can you please take a look at the last commit? I have updated the test to make it more clear what is getting fixed here. Essentially, the order of fields in the table is (id, name, price, ts). The test inserts out-of-order i.e. (id, name, ts, price) and then it queries the table in the same order as defined when creating the table. The query order does not matter here, it is the out-of-order insert that is getting fixed. Here is the updated test:

        spark.sql(
          s"""
             |create table $tableName (
             |  id int,
             |  name string,
             |  price int,
             |  ts long
             |) using hudi
             | options (
             |  primaryKey ='id',
             |  type = 'mor',
             |  preCombineField = 'ts'
             | )
             | partitioned by(price)
             | location '$basePath'
       """.stripMargin)
        spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 10, 1000)")
        spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 100, 200000)")
        checkAnswer(s"select id, name, price, ts from $tableName")(
          Seq(1, "a1", 1000, 10),
          Seq(2, "a2", 200000, 100)
        )

codope avatar Apr 17 '24 12:04 codope

CI report:

  • 4e3719de12d6b1e66e2482a2cc35a574eff81d32 Azure: SUCCESS
Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

hudi-bot avatar Apr 17 '24 14:04 hudi-bot

@codope That makes sense if you need to specify the column names in insert into, but what happens if you did it in a different order than the one you tried?

In the previous test iteration, it was

spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)")
spark.sql(s"insert into $tableName values(2, 'a2', 100, 200000)")

and didn't specify which value corresponds to a column. Why did it assume that the order was "id, name, ts, price" but we defined the table as "id, name, price, ts". That is the part that is confusing me. The default order should match the table order. But from what I am seeing, the default order still has the partition column at the end.

Will the test still work if it is:

spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 10, 1000)")
spark.sql(s"insert into $tableName (name,  price, ts, id) values('a2', 200000, 100, 2)")

jonvex avatar Apr 17 '24 15:04 jonvex

@codope Is this PR still relevant?

yihua avatar Jul 26 '24 01:07 yihua

No, we have added validation and updated the docs.

codope avatar Jul 29 '24 09:07 codope