Hash collisions regularly occur between strings containing Chinese characters
Here’s how I can reproduce this:
- I kill
clipmenudand remove/run/user/1000/clipmenu.7.1000. - I run
env CM_SELECTIONS=clipboard CM_OWN_CLIPBOARD=1 CM_DEBUG=1 clipmenud. - I run
printf 身 | xsel --clipboard. The character ‘身’ (0xE8 0xBA 0xAB in UTF-8) is in the clipboard. - I run
printf aaa | xsel --clipboard. Now ‘aaa’ is in the clipboard and at the top of clipmenu history. - I run
printf 車 | xsel --clipboard. The character ‘車‘ (0xE8 0xBB 0x8A in UTF-8) is not in the clipboard nor at the top of clipmenu history, but ‘身’ is.
I can’t see any significant difference in the debugging output for the third and fifth steps.
If I switch steps 3 and 5, I get the same behaviour, but with the rôle of the two characters switched: it’s now ‘車’ that is always in the clipboard/history and ‘身’ is unreachable.
It looks like there is a hash collision, since in both cases the clippings are stored in the same subdirectory 000000000B8AA672 of /run/user/1000/clipmenu.7.1000.
I’ve encountered the same kind of collision with other pairs of individual characters, but I didn’t record them. (But you can derive new collisions from the one above: ‘身車’ and ‘車身’, ‘aaaaaaaaaaaaaaaaaaaaaaaaa身aaaaaaaaaaaaaaaaaaaa’ and ‘aaaaaaaaaaaaaaaaaaaaaaaaa車aaaaaaaaaaaaaaaaaaaa’, etc.)
This is a carryover from old clipmenu logic (in the old one we use cksum, not djb2, which has different properties). It's one of the blockers for clipmenu 8, and will be improved before release.
I thought there would have been some decoding problem with the Chinese characters, but according to https://dmytry.blogspot.com/2009/11/horrible-hashes.html, even bA and ab collide, as I just checked.
The decoding is fine, we just read them as uint. The choice to use a simple hash function is intentional: we need to handle collisions anyway, and this one is simple and good enough :-) It's just that there's no collision handling yet.
3ddf819cc4951dfc6ae8df16bd6cb79498cdd694 will help with collisions. We should still do collision detection, but at least this avoids the immediate problem.
Thank you!
Should we close this issue, since as it is literally titled, it should be solved, with collisions now happening only exceptionally?
Thanks for reporting! Now both real world collisions are highly unlikely, and they will be rejected.