woodstox icon indicating copy to clipboard operation
woodstox copied to clipboard

Improve hashing implementation used for symbol tables for element and attribute names

Open cowtowncoder opened this issue 8 years ago • 0 comments

Existing symbol-table implementation uses hashing implementation similar to default JDK String.hashCode(). While this has benefits like simplicity, and comparable performance to standard String hashing, there are some well-documented issues regarding hash-collisions. It would be good to improve these aspects, to avoid most obvious problems.

There are many possible ways to improve handling; one of first things to check could be to see if and how changes to Jackson handling could be adopted. Jackson's handling differs a bit (since it uses straight-from-bytes to char[] approach, like Aalto) but there may be things to take. Other sources should be considered too, including hash algorithms that do not use simple multiply-then-append, and are not as easy to generate collisions against.

One idea not used by Jackson, but potentially useful here could be consideration of JDK8 style overflow areas, where secondary sorting is used for "too big" buckets. Such buckets could use simple binary search. That would be a bigger change, but would eliminate the problem.

cowtowncoder avatar Apr 05 '16 20:04 cowtowncoder