ClickHouse Less locking in generateUUIDv7 #64507

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Performance improvement while generating UUIDv7

Jul 31 '24 22:07 JosuGZ

All committers have signed the CLA.

Jul 31 '24 22:07 CLAassistant

This is an automated comment for commit f6c2966e1643c292075f2e9bd20455876a6ee041 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success

Aug 01 '24 22:08 robot-ch-test-poll2

@JosuGZ Before I proceed with a review, please sign the CLA first, thanks.

Aug 05 '24 10:08 rschu1ze

@JosuGZ Before I proceed with a review, please sign the CLA first, thanks.

Done!

Aug 05 '24 11:08 JosuGZ

@rschu1ze do you need some comments to clarify anything?

The main difference with the snowflakes is that the following overflow can't be done on UUIDv7 as increasing the timestamp means generating a new random counter. That means the range would have a random end.

Instead, I add "capacity" to the range, which shows how many UUIDs are can be tenerated with the range. That way, new ranges can be reserved if capacity is smaller than input_rows_count.

Aug 12 '24 12:08 JosuGZ

Everyone is still focussing on fixing the CI (stabilizing it), please expect some delay, sorry.

Aug 12 '24 19:08 rschu1ze

Everyone is still focussing on fixing the CI (stabilizing it), please expect some delay, sorry.

No problem, good luck with that!

Aug 13 '24 11:08 JosuGZ

Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Sep 17 '24 13:09 clickhouse-gh[bot]

Hi @rschu1ze, I have rebased the branch on top of master, can you take a look?

Oct 02 '24 05:10 JosuGZ

@JosuGZ Sorry the code might be faster now but it became so difficult to understand that I can't verify it. If you like to continue, feel free to but the PR would need to be simplified somehow.

Oct 14 '24 20:10 rschu1ze

I would appreciate knowing the part that you consider difficult to understand before I try to change anything. I tried to keep the code as close as possible to the snowflake code. Is it the while loop?

On snowflakes, the range looks like this:

However, on UUIDv7, following what the original code does (and the recommendations about UUIDv7), the range has holes, as updating the timestamp means initializing the counter to a random value. It looks like this:

That means that we can't just iterate from begin to end, so we have 3 options:

Store the information about the holes on the range. This seems complicated.
Assume that we will never have an overflow. With 42 bits this can be done, for example setting the first bit to 0 when generating the random value. This could break should the counter length be updated.
Reserve a range that may have less capacity than input_rows_count, and keep reserving ranges until we process all the rows.

My approach is the last one, and I have tested the code forcing the counter length to be 5 bits (to test the overflow). This approach also has the benefit of removing the overflow check inside the inner loop.

Oct 15 '24 10:10 JosuGZ