hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] RO table did not get updated while RT table did

Open satishmalladi-m opened this issue 3 years ago • 6 comments

Hi

We are facing an issue when we first doing bulk_insert for batch load which we had 9 million records , we are getting two tables one is RT and other is RO Table for bulk_insert everything looks good , but for delta records when we do upsert we are able to update only one table i.e RT table getting updated but RO table is not able to update , could you please help me on this issue

satishmalladi-m avatar Jul 14 '22 11:07 satishmalladi-m

looks like something to do with meta sync where RO is not getting sync'ed. please provide scripts and configs for reproducing then we can help from there.

xushiyan avatar Jul 17 '22 04:07 xushiyan

for RO table delta commit, It can only be seen after compact action

KnightChess avatar Jul 18 '22 04:07 KnightChess

please find below configuration which we are using currently

hudi_options = { 'hoodie.datasource.write.table.type': self._write_table_type, 'hoodie.table.name': self._table_name, 'hoodie.datasource.write.recordkey.field': self._record_key, 'hoodie.datasource.write.partitionpath.field': self._partition_field, 'hoodie.datasource.write.precombine.field': self._combine_key, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.parquet.max.file.size': "20971520", 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': self._table_name.lower(), 'hoodie.datasource.hive_sync.partition_fields': self._partition_field, 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.database': self._hive_database.lower(), 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.support_timestamp': 'true' }

satishmalladi-m avatar Jul 18 '22 05:07 satishmalladi-m

looks like something to do with meta sync where RO is not getting sync'ed. please provide scripts and configs for reproducing then we can help from there.

please find below configuration which we are using currently

hudi_options = { 'hoodie.datasource.write.table.type': self._write_table_type, 'hoodie.table.name': self._table_name, 'hoodie.datasource.write.recordkey.field': self._record_key, 'hoodie.datasource.write.partitionpath.field': self._partition_field, 'hoodie.datasource.write.precombine.field': self._combine_key, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.parquet.max.file.size': "20971520", 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': self._table_name.lower(), 'hoodie.datasource.hive_sync.partition_fields': self._partition_field, 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.database': self._hive_database.lower(), 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.support_timestamp': 'true' }

satishmalladi-m avatar Jul 19 '22 11:07 satishmalladi-m

looks like something to do with meta sync where RO is not getting sync'ed. please provide scripts and configs for reproducing then we can help from there.

please find below configuration which we are using currently

hudi_options = { 'hoodie.datasource.write.table.type': self._write_table_type, 'hoodie.table.name': self._table_name, 'hoodie.datasource.write.recordkey.field': self._record_key, 'hoodie.datasource.write.partitionpath.field': self._partition_field, 'hoodie.datasource.write.precombine.field': self._combine_key, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.parquet.max.file.size': "20971520", 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': self._table_name.lower(), 'hoodie.datasource.hive_sync.partition_fields': self._partition_field, 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.hive_sync.database': self._hive_database.lower(), 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.hive_sync.support_timestamp': 'true' }

As mentioned by @KnightChess RT and RO tables are synced when you run compaction on Hudi MOR tables. i.e. The Delta (Avro) Files are merged into the Parquet Files. In HUDI COW Tables == Data in Parquet Files MOR Tables == Data in Avro + Parquet Files

You can read the configs from the docs here

Some sample configs you should provide

## Compaction
    'hoodie.compact.inline.max.delta.seconds' : 60,
    'hoodie.compact.inline.max.delta.commits' : 4,
    'hoodie.compact.inline.trigger.strategy' : 'NUM_OR_TIME',
    'hoodie.compact.inline' : True,
    'hoodie.datasource.compaction.async.enable' : True,

This will trigger compaction after every 60 seconds or 4 delta commits for a streaming job. Read more about what is compaction in Hudi here

srehman420 avatar Jul 26 '22 15:07 srehman420

@satishmalladi-m as mentioned by @KnightChess and @glory9211 , it's possible that compaction has not run which resulted in RO table not updated. can you confirm if the sync succeeds after compaction ?

xushiyan avatar Aug 09 '22 23:08 xushiyan

@satishmalladi-m @KnightChess @glory9211 : any updates around this.

nsivabalan avatar Aug 16 '22 07:08 nsivabalan

@satishmalladi-m : any updates please.

nsivabalan avatar Aug 28 '22 00:08 nsivabalan

analysis and suggestions were provided above. closing due to inactivity

xushiyan avatar Oct 30 '22 17:10 xushiyan