string-interner Fallible APIs for StringInterner for #70

Hi @Robbepop!

This is my take on issue #70.

This pull request is NOT READY to merge. I haven't added any tests nor did I check whether new code breaks anything in the existing test suites.

Let's iterate this through here instead.

May 27 '24 14:05 royaltm

First, I haven't found issues adding fallible API to Backend implementations (except the capacity issue, below).

The most problematic change involves the dedup HashMap:

to implement try_reserve I needed to directly access HashTable, thus a raw feature needed to be enabled for hashbrown
there's a substanstial addition here: https://github.com/royaltm/string-interner/blob/e8a7616160f8834f53927d72532aab12ec27bf2a/src/interner.rs#L163 which might beg for a bigger refactor to this function

I'm also a bit concerned about the sanity of the with_capacity and the new try_reserve API, since the implementation relies on guesswork - estimating average word size by google - like with everything "on average" will mostly be of no use for most particular cases.

What if instead we give a user a way to estimate the "average" string size? Or perhaps even the separate additional capacity for string data.

May 27 '24 15:05 royaltm

@royaltm thank you for the PR!

I think this PR generally goes into the right direction.

The RawMap API should have been implemented and used by the StringInterner long ago, but I didn't get to it. Maybe it would be nice to separate this change into another PR and basing this PR on it. The idea is to no longer use hashbrown's RawEntry API and instead use its RawMap API for everything.

CI is currently unhappy with docs and formatting. Also a memory consumption test is broken because the number of expected allocations decreased. Can you explain why?

I am a bit sceptic about the usefulness of the reserve API for StringInterner in general after seeing the mess that it introduces. The reserve API was once useful when we still had the SimpleBackend. However, the SimpleBackend got removed because it was too inefficient and thus served no real purpose. Without it the reserve API depends too much on heuristics and should probably be removed or redesigned to not be based on the expected number of strings to intern but on the expected bytes of all expected strings to be interned but even then it wouldn't be ideal imo.

May 29 '24 11:05 Robbepop

Perhaps different number of allocations must have something to do with changes to the buffer backend where I introduced the calculate_var7_size function so I am able to reserve space for the next entry including the var7 size marker ahead of pushing to the backend buffer. Now the buffer backend is resized only once before encode_var_usize is called and the string itself is pushed. Previously var7 size marker was pushed separately from the string itself and each operation could possibly reallocate backend buffer. But that's just my guess.

As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls?

May 29 '24 12:05 royaltm

As for the RawTable API, I'm not quite sure what advantage it would have over current implementation. Perhaps just less intermediary function calls?

I have read somewhere that the RawEntry API of the hashbrown crate was to be phased out in favor of the newer and more versatile RawMap type. Since you are now making use of the raw crate feature I thought it would make sense to go full in with the change. Not necessary though. We can go ahead without it.

May 29 '24 13:05 Robbepop

I suppose moving to RawMap interface should be only an internal change. I propose we can tackle this issue after the one at hand.

Regarding try_reserve and with_capacity interface I agree that its usefulness is pretty questionable. One thing is that It's really hard to make predictions ahead. Especially with the BucketBackend and what makes the problem even worse is that we can't shrink_to_fit the head String due to the known problems with reallocation.

Perhaps I'll remove the try_reserve calls from API or make them internal where they are still required, and we'll see where that get us.

May 29 '24 16:05 royaltm

I suppose moving to RawMap interface should be only an internal change. I propose we can tackle this issue after the one at hand.

Regarding try_reserve and with_capacity interface I agree that its usefulness is pretty questionable. One thing is that It's really hard to make predictions ahead. Especially with the BucketBackend and what makes the problem even worse is that we can't shrink_to_fit the head String due to the known problems with reallocation.

Perhaps I'll remove the try_reserve calls from API or make them internal where they are still required, and we'll see where that get us.

The try_reserve probably does not make sense for the Backend to implement since it is too backend specific. At the moment no backend has a proper and well defined way to do it. For the StringInterner itself try_reserve could still be useful for its hashmap to deduplicate strings, however, it is probably not much help in practice.

May 29 '24 19:05 Robbepop

Hi @Robbepop!

Sorry for the delay, I've had a whirlwind of obligations to fulfill in the passing weeks and I'd rather not contribute at all than contribute haphazardly during my lunch break as good and quickly seldom meet.

I removed the Backend::try_reserve from trait and backend implementations,
I left StringInterner::try_reserve with a doc note about what it does exactly,
fixed docs issues,
added a mention of try_get_or_intern in the root doc of the StringInterner.
testing of calculate_var7_size together with each encode_var_usize,
updated number of (de)allocations in buffer_backend::test_memory_consumption.

If this meets your criteria I don't think there's anything left considering the scope of this PR.

Jun 11 '24 12:06 royaltm

Hi @Robbepop,

I've fixed formatting errors, I'm not sure however what can I do to fix the tarpaulin issue. Is there anything else on my end that needs to be done?

Jun 20 '24 11:06 royaltm

I just ran benchmarks for this branch locally and compared them to the current main branch.

Unfortunately I saw quite some big performance regressions with this PR:

get_or_intern/fill-empty/new/: ~10% regression
get_or_intern/fill-empty/with_capacity/: ~20-30% regression
get_or_intern/already-filled/: ~5% regression
get_or_intern_static: BuckedBackend improved by 5%, other backends regressed by 10%

Jun 21 '24 07:06 Robbepop

All in all this looks good. The benchmark regressions need to be fixed before we can merge though.

Jun 21 '24 08:06 Robbepop

I'm beginning to think that without stabilization of push_within_capacity we can't do much to improve backend's fill performance, as we ended up with an algorithm that checks the capacity of the vector multiple times for a single push operation. Please see the comment above https://github.com/Robbepop/string-interner/pull/71#discussion_r1648808767 for more on this.

If I'm not mistaken another area of improvement might be in the raw hash-map department. With the latest changes I have also introduced redundant capacity checks in the try_get_or_intern_using function.

Jun 21 '24 14:06 royaltm

Also, while I was pondering on the var-int buffer encoding issue if we are going to address the "2 phased encoding step" and we decide to try the single-step array encoding approach I'd also suggest to change the var-int algorithm to the more efficient one:

I've created a gist with more info on the topic and playground examples: https://gist.github.com/royaltm/f31ce5b1c8878e7c10ff7d335f3826e2

Please feel free to check it out at your leisure.

Jun 21 '24 14:06 royaltm

Perhaps a better approach is to use https://doc.rust-lang.org/alloc/vec/struct.Vec.html#method.spare_capacity_mut.

Something along the lines of: https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=bd41840d6595d2787c2f60b86a67e84b

Although a lot of additional unsafety is introduced the generated assembly looks neat:

There's a very tight loop at: .LBB10_5 and a single capacity pre-check that branches to .LBB10_2 to make more room.

playground::encode_var_usize:
	push	r14
	push	rbx
	push	rax
	mov	rbx, rsi
	lea	rdx, [rsi + 64]
	mov	rax, qword ptr [rdi]
	mov	rsi, qword ptr [rdi + 16]
	sub	rax, rsi
	cmp	rax, rdx
	jb	.LBB10_2
	mov	rcx, qword ptr [rdi + 8]
	add	rcx, rsi
	xor	eax, eax
	cmp	rbx, 128
	jb	.LBB10_6

.LBB10_4:
	mov	rdx, rbx

.LBB10_5:
	mov	r8d, edx
	or	r8b, -128
	mov	byte ptr [rcx + rax], r8b
	inc	rax
	add	rdx, -128
	mov	rbx, rdx
	shr	rbx, 7
	cmp	rdx, 16383
	mov	rdx, rbx
	ja	.LBB10_5

.LBB10_6:
	mov	byte ptr [rcx + rax], bl
	lea	rcx, [rsi + rax]
	inc	rcx
	inc	rax
	mov	qword ptr [rdi + 16], rcx
	add	rsp, 8
	pop	rbx
	pop	r14
	ret

.LBB10_2:
	mov	r14, rdi
	call	alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle
	mov	rdi, r14
	mov	rsi, qword ptr [r14 + 16]
	mov	rcx, qword ptr [r14 + 8]
	add	rcx, rsi
	xor	eax, eax
	cmp	rbx, 128
	jae	.LBB10_4
	jmp	.LBB10_6

Jun 21 '24 15:06 royaltm

string-interner string-interner copied to clipboard

Fallible APIs for StringInterner for #70

string-interner
string-interner copied to clipboard