hudi
hudi copied to clipboard
[SUPPORT] hive-sync
To Reproduce
Whether there are parameters in hive_sync can be controlled. Each synchronization will only incrementally synchronize the partition contents, and will no longer complete the missing partitions in hive-matestore. Because I will clean up the historical hive partition data to ensure that there is a stable amount of partition data in hive instead of growing all the time.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.14.1
-
Spark version : spark3.3
-
Hive version : 3.1.3
-
Hadoop version : 3.3.6
-
Storage (HDFS/S3/GCS..) : GCS
-
Running on Docker? (yes/no) : no
This is actually caused by the inconsistency between the partition metadata of hive and the partition metadata of hudi.
In my opinion, should we change our thinking, for example, when cleaning the hive partition, also delete the hudi partition metadata?
Thank you for your reply, it is a good idea. Can you provide a method to safely delete hudi meta? It feels like a dangerous behavior
Because I will clean up the historical hive partition data to ensure that there is a stable amount of partition data in hive instead of growing all the time.
That's a pragmatic idea, would you mind to contribute it, should be a minor work.
@BruceKellan How you are planning to clean your partitions?
shouldn't we try to leverage partition TTL support in hudi to delete older partitions.