hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] HUDI partition table duplicate data cow hudi 0.10.0 flink 1.13.1

Open zwj0110 opened this issue 3 years ago • 2 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

hudi sink config

    'connector' = 'hudi',
    'hoodie.table.name' = 'xxx',
    'table.type' = 'COPY_ON_WRITE',
    'path' = 'xxxx',
    'hoodie.datasource.write.keygenerator.type' = 'COMPLEX',
    'hoodie.datasource.write.recordkey.field' = 'id',
    'hoodie.cleaner.policy' = 'KEEP_LATEST_FILE_VERSIONS',
    'hoodie.cleaner.fileversions.retained' = '20',
    'hoodie.keep.min.commits' = '30',
    'hoodie.keep.max.commits' = '40',
    'hoodie.cleaner.commits.retained' = '20',
    'write.operation' = 'upsert',
    'write.commit.ack.timeout' = '60000000',
    'write.sort.memory' = '128',
    'write.task.max.size' = '1024',
    'write.merge.max_memory' = '100',
    'write.tasks' = '96',
    'write.precombine' = 'true',
    'write.precombine.field' = 'meta_es_offset',
    'index.state.ttl' = '0',
    'index.global.enabled' = 'false',
    'hive_sync.enable' = 'true',
    'hive_sync.table' = 'xxx',
    'hive_sync.auto_create_db' = 'true',
    'hive_sync.mode' = 'hms',
    'hive_sync.metastore.uris' = 'xxx',
    'hive_sync.db' = 'xxx',
    'hoodie.datasource.write.partitionpath.field' = 'year,month,day',
    'hoodie.datasource.write.hive_style_partitioning' = 'true',
    'hive_sync.partition_fields' = 'year,month,day',
    'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'index.bootstrap.enabled' = 'true'

description:

After initialize history data, set scan.startup.mode as timestamp,and set the timestamp ahead, the duplicate occur,and if we restart the job from checkpoint, the data is well

data duplicate result:

image

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.10.0

  • flink version : 1.13.1

  • Hive version : 3.1.2

  • Hadoop version : 3.1.0

  • Storage (HDFS/S3/GCS..) : s3

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

zwj0110 avatar Sep 09 '22 02:09 zwj0110

mark

Zhifeiyu avatar Sep 09 '22 02:09 Zhifeiyu

@yuzhaojing @danny0405 Could any one of you chime in here?

yihua avatar Sep 21 '22 22:09 yihua

@yuzhaojing @danny0405 : gentle ping. @zwj0110 : feel free to close if the issue is resolved.

nsivabalan avatar Oct 23 '22 04:10 nsivabalan

Not enough details here, can you try 0.12.1 and see if the duplicates happen ? Would close the issue here first.

danny0405 avatar Nov 08 '22 02:11 danny0405