amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[AMORO-3252] Fix primary key duplicate exception when concurrently inserting table optimization process entries into database

Open Jzjsnow opened this issue 7 months ago • 8 comments

Why are the changes needed?

Currently the primary keys of the tables table_optimizing_process, task_runtime, and optimizing_task_quota contain only process_id and not table_id, where process_id is created by the current timestamp, e.g. 1744702084683. When multiple tables are optimized concurrently, it is easy to have the same process_id at the same trigger time, resulting in a primary key duplication exception org.apache.ibatis.exceptions.PersistenceException during persistence.

Close #3252.

Brief change log

  • Add table_id to the primary keys of table_optimizing_process, task_runtime, and optimizing_task_quota respectively.

How was this patch tested?

  • [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • [ ] Add screenshots for manual tests if appropriate

  • [x] Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Jzjsnow avatar Apr 15 '25 13:04 Jzjsnow

Hi, Thanks for driving this.

After I listed all SQL related to table_optimizing_process, I found that the condition of some SQL does not contain table_name, which may cause poor performance, like: https://github.com/apache/amoro/blob/master/amoro-ams/src/main/java/org/apache/amoro/server/persistence/mapper/OptimizingMapper.java#L147

I think the table_optimizing_process might only need a globally unique ID as the primary key, without necessarily including table_name. However, the current ID generation rules do indeed carry a risk of duplication.

Currently, we use thecurrentTimestamp generation rule, which I understand is for easier cleanup operations. However, perhaps we should optimize the current cleanup logic, for example, by partitioning this table and using DROP PARTITION to perform cleanup more efficiently. #3445 is following the cleaning improvement issue.

zhoujinsong avatar Apr 28 '25 08:04 zhoujinsong

Hi, Thanks for driving this.

After I listed all SQL related to table_optimizing_process, I found that the condition of some SQL does not contain table_name, which may cause poor performance, like: https://github.com/apache/amoro/blob/master/amoro-ams/src/main/java/org/apache/amoro/server/persistence/mapper/OptimizingMapper.java#L147

I think the table_optimizing_process might only need a globally unique ID as the primary key, without necessarily including table_name. However, the current ID generation rules do indeed carry a risk of duplication.

Currently, we use thecurrentTimestamp generation rule, which I understand is for easier cleanup operations. However, perhaps we should optimize the current cleanup logic, for example, by partitioning this table and using DROP PARTITION to perform cleanup more efficiently. #3445 is following the cleaning improvement issue.

@zhoujinsong Thanks for the viewpoint. I think expiring all tables during cleanup optimization can greatly improve optimization performance as discussed in #3445. On the other hand, I think it is still necessary to keep the tableid in the query sql related to tables table_optimizing_process/task_runtime / optimizing_task_quota because we may still need to process the optimization information of a single table.

For example, as mentioned in #3445, we'd better clean up all the optimization information of a table immediately when it is dropped, and in #3554 when we clear orphan table information during table service initializing, we still need tableid to filter out the optimizing entries related to the invalid tables.

Jzjsnow avatar May 12 '25 07:05 Jzjsnow

I think the table_optimizing_process might only need a globally unique ID as the primary key, without necessarily including table_name. However, the current ID generation rules do indeed carry a risk of duplication.

Thanks to the tip about the process_id unique key, I have since dug into the use of the table_optimizing_process table in depth. I found that process_id is critical to apply as a primary key, such as canceling the optimizing process (from the front end) and updating the taskRuntime etc.

I agree that using timestamps as process_id is better for cleanup process performance, and I think the solution to the current problem of having duplicate timestamps is clearer and more concise: the duplicate process_id occurs due to the highly concurrent execution of the TableOptimizingProcess#planInternal().

In the latest commit, i generate the process_id with a timestamp before the asynchronization to ensure the uniqueness of the process_id, PTAL.

Jzjsnow avatar May 19 '25 10:05 Jzjsnow

@Jzjsnow Hi,After resolving the conflict, we can continue with the review

czy006 avatar Jun 04 '25 10:06 czy006

@Jzjsnow Hi,After resolving the conflict, we can continue with the review

@czy006 Sure, I've resolved the conflict and rebased the changes. PTAL

Jzjsnow avatar Jun 04 '25 11:06 Jzjsnow

LGTM for the new change, as the change only changes the planTime generate logic, and the planTime, which was generated in a single thread, the change here avoids the duplication of planTime for different tables before.

I'm not sure if there's a chance of duplication of planTime in the multi-node scenario, but that can be addressed in a separate issue if it exists.

klion26 avatar Jun 05 '25 09:06 klion26

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 28.28%. Comparing base (e69710a) to head (e950979). Report is 4 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3520      +/-   ##
============================================
+ Coverage     21.76%   28.28%   +6.51%     
- Complexity     2391     3711    +1320     
============================================
  Files           436      617     +181     
  Lines         40498    49794    +9296     
  Branches       5743     6434     +691     
============================================
+ Hits           8816    14085    +5269     
- Misses        30935    34698    +3763     
- Partials        747     1011     +264     
Flag Coverage Δ
core 28.28% <100.00%> (?)
trino ?

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Jun 05 '25 12:06 codecov-commenter

I am not sure if the new change can resolve the conflicts of process ID generation. Different groups will generate processes at the same time, which may generate the same process ID.

HDYT?

zhoujinsong avatar Jun 16 '25 03:06 zhoujinsong

@xxubai @zhoujinsong Thanks a lot for the feedback!

We’ve made the change to use Snowflake IDs instead of timestamps. This should fix the uniqueness processId issue between different groups mentioned above, especially when processes are generated at the same time.

Hope this clears things up. Please take a look when you are free.

Jzjsnow avatar Jul 15 '25 13:07 Jzjsnow