clipmenu icon indicating copy to clipboard operation
clipmenu copied to clipboard

Hash collisions regularly occur between strings containing Chinese characters

Open vejkse opened this issue 7 months ago • 5 comments

Here’s how I can reproduce this:

  1. I kill clipmenud and remove /run/user/1000/clipmenu.7.1000.
  2. I run env CM_SELECTIONS=clipboard CM_OWN_CLIPBOARD=1 CM_DEBUG=1 clipmenud.
  3. I run printf 身 | xsel --clipboard. The character ‘身’ (0xE8 0xBA 0xAB in UTF-8) is in the clipboard.
  4. I run printf aaa | xsel --clipboard. Now ‘aaa’ is in the clipboard and at the top of clipmenu history.
  5. I run printf 車 | xsel --clipboard. The character ‘車‘ (0xE8 0xBB 0x8A in UTF-8) is not in the clipboard nor at the top of clipmenu history, but ‘身’ is.

I can’t see any significant difference in the debugging output for the third and fifth steps.

If I switch steps 3 and 5, I get the same behaviour, but with the rôle of the two characters switched: it’s now ‘車’ that is always in the clipboard/history and ‘身’ is unreachable.

It looks like there is a hash collision, since in both cases the clippings are stored in the same subdirectory 000000000B8AA672 of /run/user/1000/clipmenu.7.1000.

I’ve encountered the same kind of collision with other pairs of individual characters, but I didn’t record them. (But you can derive new collisions from the one above: ‘身車’ and ‘車身’, ‘aaaaaaaaaaaaaaaaaaaaaaaaa身aaaaaaaaaaaaaaaaaaaa’ and ‘aaaaaaaaaaaaaaaaaaaaaaaaa車aaaaaaaaaaaaaaaaaaaa’, etc.)

vejkse avatar May 26 '25 11:05 vejkse

This is a carryover from old clipmenu logic (in the old one we use cksum, not djb2, which has different properties). It's one of the blockers for clipmenu 8, and will be improved before release.

cdown avatar May 26 '25 14:05 cdown

I thought there would have been some decoding problem with the Chinese characters, but according to https://dmytry.blogspot.com/2009/11/horrible-hashes.html, even bA and ab collide, as I just checked.

vejkse avatar May 26 '25 16:05 vejkse

The decoding is fine, we just read them as uint. The choice to use a simple hash function is intentional: we need to handle collisions anyway, and this one is simple and good enough :-) It's just that there's no collision handling yet.

cdown avatar May 26 '25 16:05 cdown

3ddf819cc4951dfc6ae8df16bd6cb79498cdd694 will help with collisions. We should still do collision detection, but at least this avoids the immediate problem.

cdown avatar May 28 '25 18:05 cdown

Thank you!

Should we close this issue, since as it is literally titled, it should be solved, with collisions now happening only exceptionally?

vejkse avatar May 31 '25 09:05 vejkse

Thanks for reporting! Now both real world collisions are highly unlikely, and they will be rejected.

cdown avatar Jun 10 '25 23:06 cdown