hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT]BaseDatasetBulkInsertCommitActionExecutor execute WriteStatus donot persist cause data write hudi 4 times

Open dongtingting opened this issue 1 year ago • 7 comments

Describe the problem you faced

There is a job use bulk insert insert overwrite a cow table. We find there are 4 stage run bulk insert write, data write 4 times and only the last stage data remain, other 3 stage writen data is finally remove when finalize write.

all of the four stage on red line do bulk insert write. image more details about the four stage: image image

This is because the four stage all use writestatus rdd, DatasetBulkInsertOverwriteCommitActionExecutor writeStatus rdd is not persist, this will case upstream rdd(bulk insert write) repeat running 4 times.

DatasetBulkInsertOverwriteCommitActionExecutor getPartitionToReplacedFileIds: use writestatus isEmpty DatasetBulkInsertOverwriteCommitActionExecutor getPartitionToReplacedFileIds: use writestatus distinct HoodieSparkSqlWriter.commitAndPerformPostOperations : use writestatus count HoodieSparkSqlWriter.commitAndPerformPostOperations :use writestatus collect

Upsert(do not use bulk insert) do not have this problem, because they persist writestatus. But BaseDatasetBulkInsertCommitActionExecutor do not persist writestatus. I think we should persist rdd at the beging of BaseDatasetBulkInsertCommitActionExecutor.buildHoodieWriteMetadata, does anyone agree?

image image

To Reproduce

Steps to reproduce the behavior:

  1. create a cow table test_table, using simple index
create table if not exists test_table
(
    id                                string     
   , name                             string    
   ,p_date                            string        comment '分区日期, yyyyMMdd'
)USING hudi
     partitioned by (p_date)
      options (
      type='cow'
);
  1. insert overwrite table use bulk insert
set hoodie.datasource.write.operation=BULK_INSERT;
set hoodie.bulkinsert.shuffle.parallelism=200;

insert overwrite test_table   partition (p_date = '20240806')
select  id, name, p_date 
from source table 
  1. check spark job task log, you will find isEmpty, all of distinct , count , collect stage task log have create marker and create handle log.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.14.0

  • Spark version : 2.4

  • Hadoop version : 2.6

dongtingting avatar Aug 08 '24 14:08 dongtingting

@danny0405 @beyond1920 can you help me to confirm?

dongtingting avatar Aug 08 '24 14:08 dongtingting

@KnightChess maybe you can give some insights here, also cc @nsivabalan for visibility.

danny0405 avatar Aug 09 '24 01:08 danny0405

@dongtingting Good catch. Thanks for fire this issue. It seems writestatus rdd does not persist. I would like to check this problem today, would reply later.

beyond1920 avatar Aug 09 '24 01:08 beyond1920

@dongtingting nice catch. we can persist writestatus avoid this issue.

KnightChess avatar Aug 09 '24 06:08 KnightChess

@dongtingting nice catch. we can persist writestatus avoid this issue.

thanks very munch for your relay。 I am glad to fix it, later i will create a pr to fix it.

dongtingting avatar Aug 09 '24 09:08 dongtingting

@dongtingting Thanks a lot. Let us know when you have PR ready.

Creating tracking JIRA for the same - https://issues.apache.org/jira/browse/HUDI-8078

ad1happy2go avatar Aug 14 '24 12:08 ad1happy2go

@dongtingting Thanks a lot. Let us know when you have PR ready.

Creating tracking JIRA for the same - https://issues.apache.org/jira/browse/HUDI-8078

sorry for late reply. i have create a pr to fix: https://github.com/apache/hudi/pull/11811/files. cc @KnightChess

dongtingting avatar Aug 21 '24 13:08 dongtingting