large-hashable
large-hashable copied to clipboard
Fix bug #25 for GHC 9.4 and text-2
@sternenseemann @tvh we should merge this PR. Who has the rights to do so?
Lazy text also uses UTF-8 internally (it's just a list of strict tests). I don't see a way to retain backwards compatibility without introducing a performance overhead, e.g. by converting UTF-8 to UTF-16 to do the hashing.
I agree there is no way to avoid that overhead. The reason why I mentioned lazy text is that it would avoid having to have all in memory at the same time.
Here are some benchmark numbers for a 1MB text. I only report the mean of the benchmark executed with criterion.
- Hash native UTF-8 encoding: 1.412 ms
- Hash UTf-16 encoding (for backwards compatibility): 16.53 ms
- Convert text to string, then hash the chars: 19.46 ms
- Directly hash the string: 14.02 ms
So hashing the native UTF-8 encoding is much faster (as expected).
@skogsbaer I think you will need to talk to @loisch whether this change is okay. His assumption is that the hashes coming out of this lib are stable.
For the lazy text, I think I just wanted to make sure that
- We don't convert the text into strict first
- The hash is the same as for strict text
@tvh There are only few places where stability for meqo is required. My plan is to add a new (optional) method largeHashStable and to rewrite meqo code to use largeHashStable where needed.
@tvh There is now largeHashStable with tests
Great!
Merge?
@skogsbaer I don't have permissions on this repo anymore.