amoro [Spark][Improvement]: Using ordered query to do the duplicate key check

Search before asking

[X] I have searched in the issues and found no similar issues.

What would you like to be improved?

Currently, Arctic does the duplicate key check by executing an additional group by + having count. This is unnecessary as the Arctic will do distribution and ordering before write. So if the input data had already been sorted by primary key, then Arctic could do the duplicate key check during writing.

How should we improve?

Force to add a Sort(global=false) node before writing, rewrite the WriteExec and do the duplicate key by comparing key of the current row and prev row when processing partition writes.

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jan 11 '23 08:01 baiyangtx

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Aug 20 '24 00:08 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

Feb 11 '25 00:02 github-actions[bot]