GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice
Rationale for this change
see https://github.com/apache/arrow/issues/46677
What changes are included in this PR?
see https://github.com/apache/arrow/issues/46677
Are these changes tested?
Yes
Are there any user-facing changes?
No
- GitHub Issue: #46677
@mapleFU is currently implemented interface expected?
Should we rename AppendBuffer or redesign the interface's semantics? since currently we don't really append a buffer/block, we will directly append to the last one if remaining size is enough.. I think It may introduce confusion.
Maybe aligning with arrow-rs's impl is fine..
Some personal thoughts:
- AppendBuffer ( and etc ) which returns a StringView is a bit weird, Block is not a StringView
- Now aligned with arrow-rs also a good way
Not sure why these 2 ci always failed..
To implement interface aligned with what arrow-rs did, i have to change some behavior of StringHeapBuild::FinishLastBlock() and mark it as public.
More specific, before FinishLastBlock() just resize the last block. Now FinishLastBlock() reset internal states, including current_offset_, current_out_buffer_ and current_remaining_bytes_. I believe it's more safe and reasonable.
The interface was private before so don't mind external usage, for current internal usage of FinishLastBlock(), they always reset or change all these status.
If this change is unacceptable, plz let me know, i'll try to find another way.
Gentle ping @pitrou
Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views.
Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views.
@pitrou I suppose this is used to append a whole parquet page and add buffer for it
Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.
Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.
A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy
Maybe we can also try to pick this way when average len is huge enough?
A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy
That's true, but another cost is to create the views themselves. It would be nice if a prototype could tell us which speedup we can expect.
Maybe we can also try to pick this way when average len is huge enough?
Yes, that's definitely a possibility.
So can we start to review this? We can set a ratio when average length > 12 or > 20
@mapleFU @pitrou I believe this pull request is related to several other PRs I've submitted. Here's a summary:
1- API and Handling of the Last Buffer In this pull request, I demonstrated that it’s possible to share buffers without copying or finalizing the last buffer. This avoids relocating the buffer to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes.
2-
Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.
In this pull request, I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in this issue, I think this metadata could help determine when CompactArray should be called.
Overall, my suggestion is to either modify this pull request or create a new API to support buffer sharing. It is possible to decide whether a created array should be compacted based on some metadata, in order to avoid memory bloat.
Looks (1) would work in buffer style api, but for parquet reader, it might append buffer one by one.
The (2) is a good way for compute, but here I don't know the best way to handle this: whether to adaptive read it, or just throw it to "cast" or "compact". I prefer handling this in reader, and the later handling can "compact" the data when output or throwing to compute
Besides parquet related issue, there is NO buffer sharing mechanism in current api.
- for large string sharing, sharing through this interface will save a lot of memory usage
- for inlined string it costs as much as directly
Append.
These new interfaces won't introduce more cost for current interfaces from my view. I understand tradeoff on parquet as discussed above, but I believe this kind of buffer sharing interface is missing in current implementation, and there is definitely more need for this api apart from parquet.
Maybe we can open another issue/PR to discuss specifically whether/how parquet should use this api on appending a huge page and append view from it?
Maybe we can open another issue/PR to discuss specifically whether/how parquet should use this api on appending a huge page and append view from it?
That sounds fair to me.
1- API and Handling of the Last Buffer In this pull request, I demonstrated that it’s possible to share buffers without copying or finalizing the last buffer. This avoids relocating the buffer to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes.
2-
Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.
In this pull request, I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in this issue, I think this metadata could help determine when CompactArray should be called.
Thanks for the reminder, and sorry that this is taking a long time :) I propose that we review these PRs one by one. I've started with the CompactArray one and, once that is done, I would like to then move to the AppendArraySlice improvement.
This PR here is slightly more contentious so I think we should tackle it only after the other APIs have settled semantics.