ksuid
ksuid copied to clipboard
KSUIDs do not sort properly.
On the Linux command line and in PostgreSQL, ksuids do not sort properly.
-
2SmasRGkif9qETAHwssB4GGYwdi
can be converted to the timestamp2023-07-19 04:13:36 -0400 EDT
-
2STcKMQ7MGgoDxpX9A5WuaJlx8V
can be converted to the timestamp2023-07-12 10:59:06 -0400 EDT
However, PostgreSQL and the Linux sort
command both sort 2SmasRGkif9qETAHwssB4GGYwdi
lower than 2STcKMQ7MGgoDxpX9A5WuaJlx8V
, because evidently, all lower-case letters sort before capital letters.
I'm assuming that the ksuid algorithm uses the logic that capital letters have a lower ASCII value:
Python 3.11.2 (main, May 30 2023, 17:45:26) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> ord('A')
65
>>> ord('a')
97
>>>
I love the project and use ksuid
s as much as possible, but this issue has bitten me at work and been the cause of some subtle bugs.
Is it impossible to fix in a backwards-compatible way?
There is an interesting Gist document related the problem: PostgreSQL collation is a massive footgun.
Also check this Gist comment: Functions for generating Segment's KSUIDs on PostgreSQL. They had to change the collation on their databases from en_US.UTF-8
to C.UTF-8
.
The README says that "running a set of KSUIDs through the UNIX sort
command will result in a list ordered by generation time". However, the UNIX sort
behaviour is affected by the locale variable, according to the sort
man page:
*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
I think adding a sentence in the README that specifies the correct collation can avoid this pitfall. Perhaps a warning is more effective, like the one above from the sort
manual page.