amoro
amoro copied to clipboard
[AMORO-3252] Fix primary key duplicate exception when concurrently inserting table optimization process entries into database
Why are the changes needed?
Currently the primary keys of the tables table_optimizing_process, task_runtime, and optimizing_task_quota contain only process_id and not table_id, where process_id is created by the current timestamp, e.g. 1744702084683. When multiple tables are optimized concurrently, it is easy to have the same process_id at the same trigger time, resulting in a primary key duplication exception org.apache.ibatis.exceptions.PersistenceException during persistence.
Close #3252.
Brief change log
- Add table_id to the primary keys of
table_optimizing_process,task_runtime, andoptimizing_task_quotarespectively.
How was this patch tested?
-
[ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
-
[ ] Add screenshots for manual tests if appropriate
-
[x] Run test locally before making a pull request
Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
Hi, Thanks for driving this.
After I listed all SQL related to table_optimizing_process, I found that the condition of some SQL does not contain table_name, which may cause poor performance, like: https://github.com/apache/amoro/blob/master/amoro-ams/src/main/java/org/apache/amoro/server/persistence/mapper/OptimizingMapper.java#L147
I think the table_optimizing_process might only need a globally unique ID as the primary key, without necessarily including table_name. However, the current ID generation rules do indeed carry a risk of duplication.
Currently, we use thecurrentTimestamp generation rule, which I understand is for easier cleanup operations. However, perhaps we should optimize the current cleanup logic, for example, by partitioning this table and using DROP PARTITION to perform cleanup more efficiently.
#3445 is following the cleaning improvement issue.
Hi, Thanks for driving this.
After I listed all SQL related to
table_optimizing_process, I found that the condition of some SQL does not containtable_name, which may cause poor performance, like: https://github.com/apache/amoro/blob/master/amoro-ams/src/main/java/org/apache/amoro/server/persistence/mapper/OptimizingMapper.java#L147I think the
table_optimizing_processmight only need a globally unique ID as the primary key, without necessarily includingtable_name. However, the current ID generation rules do indeed carry a risk of duplication.Currently, we use the
currentTimestampgeneration rule, which I understand is for easier cleanup operations. However, perhaps we should optimize the current cleanup logic, for example, by partitioning this table and using DROP PARTITION to perform cleanup more efficiently. #3445 is following the cleaning improvement issue.
@zhoujinsong Thanks for the viewpoint. I think expiring all tables during cleanup optimization can greatly improve optimization performance as discussed in #3445. On the other hand, I think it is still necessary to keep the tableid in the query sql related to tables table_optimizing_process/task_runtime / optimizing_task_quota because we may still need to process the optimization information of a single table.
For example, as mentioned in #3445, we'd better clean up all the optimization information of a table immediately when it is dropped, and in #3554 when we clear orphan table information during table service initializing, we still need tableid to filter out the optimizing entries related to the invalid tables.
I think the
table_optimizing_processmight only need a globally unique ID as the primary key, without necessarily includingtable_name. However, the current ID generation rules do indeed carry a risk of duplication.
Thanks to the tip about the process_id unique key, I have since dug into the use of the table_optimizing_process table in depth. I found that process_id is critical to apply as a primary key, such as canceling the optimizing process (from the front end) and updating the taskRuntime etc.
I agree that using timestamps as process_id is better for cleanup process performance, and I think the solution to the current problem of having duplicate timestamps is clearer and more concise: the duplicate process_id occurs due to the highly concurrent execution of the TableOptimizingProcess#planInternal().
In the latest commit, i generate the process_id with a timestamp before the asynchronization to ensure the uniqueness of the process_id, PTAL.
@Jzjsnow Hi,After resolving the conflict, we can continue with the review
@Jzjsnow Hi,After resolving the conflict, we can continue with the review
@czy006 Sure, I've resolved the conflict and rebased the changes. PTAL
LGTM for the new change, as the change only changes the planTime generate logic, and the planTime, which was generated in a single thread, the change here avoids the duplication of planTime for different tables before.
I'm not sure if there's a chance of duplication of planTime in the multi-node scenario, but that can be addressed in a separate issue if it exists.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 28.28%. Comparing base (
e69710a) to head (e950979). Report is 4 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #3520 +/- ##
============================================
+ Coverage 21.76% 28.28% +6.51%
- Complexity 2391 3711 +1320
============================================
Files 436 617 +181
Lines 40498 49794 +9296
Branches 5743 6434 +691
============================================
+ Hits 8816 14085 +5269
- Misses 30935 34698 +3763
- Partials 747 1011 +264
| Flag | Coverage Δ | |
|---|---|---|
| core | 28.28% <100.00%> (?) |
|
| trino | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
I am not sure if the new change can resolve the conflicts of process ID generation. Different groups will generate processes at the same time, which may generate the same process ID.
HDYT?
@xxubai @zhoujinsong Thanks a lot for the feedback!
We’ve made the change to use Snowflake IDs instead of timestamps. This should fix the uniqueness processId issue between different groups mentioned above, especially when processes are generated at the same time.
Hope this clears things up. Please take a look when you are free.