Less locking in generateUUIDv7 #64507
Changelog category (leave one):
- Performance Improvement
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Performance improvement while generating UUIDv7
This is an automated comment for commit f6c2966e1643c292075f2e9bd20455876a6ee041 with description of existing statuses. It's updated for the latest CI running
✅ Click here to open a full report in a separate page
Successful checks
| Check name | Description | Status |
|---|---|---|
| Docs check | Builds and tests the documentation | ✅ success |
| Fast test | Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here | ✅ success |
| Style check | Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report | ✅ success |
@JosuGZ Before I proceed with a review, please sign the CLA first, thanks.
@JosuGZ Before I proceed with a review, please sign the CLA first, thanks.
Done!
@rschu1ze do you need some comments to clarify anything?
The main difference with the snowflakes is that the following overflow can't be done on UUIDv7 as increasing the timestamp means generating a new random counter. That means the range would have a random end.
Instead, I add "capacity" to the range, which shows how many UUIDs are can be tenerated with the range. That way, new ranges can be reserved if capacity is smaller than input_rows_count.
Everyone is still focussing on fixing the CI (stabilizing it), please expect some delay, sorry.
Everyone is still focussing on fixing the CI (stabilizing it), please expect some delay, sorry.
No problem, good luck with that!
Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.
Hi @rschu1ze, I have rebased the branch on top of master, can you take a look?
@JosuGZ Sorry the code might be faster now but it became so difficult to understand that I can't verify it. If you like to continue, feel free to but the PR would need to be simplified somehow.
I would appreciate knowing the part that you consider difficult to understand before I try to change anything. I tried to keep the code as close as possible to the snowflake code. Is it the while loop?
On snowflakes, the range looks like this:
However, on UUIDv7, following what the original code does (and the recommendations about UUIDv7), the range has holes, as updating the timestamp means initializing the counter to a random value. It looks like this:
That means that we can't just iterate from begin to end, so we have 3 options:
- Store the information about the holes on the range. This seems complicated.
- Assume that we will never have an overflow. With 42 bits this can be done, for example setting the first bit to 0 when generating the random value. This could break should the counter length be updated.
- Reserve a range that may have less capacity than
input_rows_count, and keep reserving ranges until we process all the rows.
My approach is the last one, and I have tested the code forcing the counter length to be 5 bits (to test the overflow). This approach also has the benefit of removing the overflow check inside the inner loop.