jjtjiang comments

Results 8 comments of


                                            jjtjiang

[SUPPORT] Hudi table has duplicate data.

this is the config ,thanks .option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STREAMING_IGNORE_FAILED_BATCH.key(), "false") .option(HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key,"40") .option(HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key,"40") .option(HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key,"40") .option(HoodieWriteConfig.DELETE_PARALLELISM_VALUE.key,"40") //index options .option(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.BLOOM.name()) .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH_ENABLE.key(), "true") //merge compaction .option(DataSourceWriteOptions.HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE.key(), value = true) .option(HoodieMemoryConfig.MAX_MEMORY_FRACTION_FOR_MERGE.key(), "0.8") .option(HoodieMemoryConfig.MAX_MEMORY_FRACTION_FOR_COMPACTION.key(), "0.8") //metadata...

[SUPPORT] Hudi table has duplicate data.

In this test, we did not change the index, we only used the bloom index Through the test, I saw a strange phenomenon. At the beginning, the data was repeated,...

[SUPPORT] Hudi table has duplicate data.

> > > In this test, we did not change the index, we only used the bloom index Through the test, I saw a strange phenomenon. At the beginning, the...

[SUPPORT] Hudi table has duplicate data.

> what sql : elect _hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name from hudi_trips_snapshot where order_no=128920931 and order_type=1 |_hoodie_commit_time| _hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| +-------------------+---------------------------+-------------------------------+----------------------+---------------------------------------------------------------------------------+ | 20220602081230746|20220602081230746_81_327042|order_no:128920931,order_type:1| 2022-02| 1a02bc31-cb63-4d7d-af81-b34eae295eef-0| | 20220602081230746|20220602081230746_81_327042|order_no:128920931,order_type:1| 2022-02| 1a02bc31-cb63-4d7d-af81-b34eae295eef-0| | 20220602081230746|20220602081230746_81_327042|order_no:128920931,order_type:1| 2022-02| 1a02bc31-cb63-4d7d-af81-b34eae295eef-0| |...

[SUPPORT] Hudi table has duplicate data.

> And those records will be merged in the compaction process, which could justify the result you see, i.e., no duplication after a while (after the compaction). Without deduplication, this...

[SUPPORT] Hudi table has duplicate data.

> hoodie.datasource.merge.type My engine is spark ，and i use the default payload class : org.apache.hudi.common.model.OverwriteWithLatestAvroPayload ,spark version is 3.0, hudi version is 0.10.1, So when we read, it should deduplicate...

[SUPPORT] Hudi table has duplicate data.

> Yes, it should do dedup for log files. I'll test the default behavior when reading, to see if there is any potential bug. this is my code and some...

[SUPPORT] [OCC] HoodieException: Error getting all file groups in pending clustering

@ad1happy2go i also face this problem . version : hudi 0.12.3 how to reproduce the issue: just use the insert overwirte sql when insert a big table . here is...