arrow GH-46677: [C++] Expose an BinaryViewBuilder interface for append a binary and multiple subslice

Rationale for this change

see https://github.com/apache/arrow/issues/46677

What changes are included in this PR?

see https://github.com/apache/arrow/issues/46677

Are these changes tested?

Yes

Are there any user-facing changes?

No

GitHub Issue: #46677

Jun 06 '25 15:06 IndifferentArea

@mapleFU is currently implemented interface expected?

Jun 07 '25 02:06 IndifferentArea

Should we rename AppendBuffer or redesign the interface's semantics? since currently we don't really append a buffer/block, we will directly append to the last one if remaining size is enough.. I think It may introduce confusion.

Maybe aligning with arrow-rs's impl is fine..

Jun 07 '25 07:06 IndifferentArea

Some personal thoughts:

AppendBuffer ( and etc ) which returns a StringView is a bit weird, Block is not a StringView
Now aligned with arrow-rs also a good way

Jun 07 '25 12:06 mapleFU

Not sure why these 2 ci always failed..

Jun 08 '25 12:06 IndifferentArea

To implement interface aligned with what arrow-rs did, i have to change some behavior of StringHeapBuild::FinishLastBlock() and mark it as public.

More specific, before FinishLastBlock() just resize the last block. Now FinishLastBlock() reset internal states, including current_offset_, current_out_buffer_ and current_remaining_bytes_. I believe it's more safe and reasonable. The interface was private before so don't mind external usage, for current internal usage of FinishLastBlock(), they always reset or change all these status.

If this change is unacceptable, plz let me know, i'll try to find another way.

Jun 14 '25 14:06 IndifferentArea

Gentle ping @pitrou

Jun 26 '25 07:06 mapleFU

Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views.

Jun 26 '25 13:06 pitrou

Isn't this approach wasteful? If you have lots of strings <= 12 bytes, you will still store their contents in a data buffer, while they're inlined in the string views.

@pitrou I suppose this is used to append a whole parquet page and add buffer for it

Jun 26 '25 13:06 mapleFU

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

Jun 26 '25 13:06 pitrou

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy

Maybe we can also try to pick this way when average len is huge enough?

Jun 26 '25 13:06 mapleFU

A nice question. I agree when most Parquet strings are <= 12 bytes, it would be memory wasted because a huge memcpy is applied. But when read large binary it would benefit a lot from this. I think usally a large memcpy might much faster than little un-continogous memcpy

That's true, but another cost is to create the views themselves. It would be nice if a prototype could tell us which speedup we can expect.

Maybe we can also try to pick this way when average len is huge enough?

Yes, that's definitely a possibility.

Jun 26 '25 14:06 pitrou

So can we start to review this? We can set a ratio when average length > 12 or > 20

Jun 26 '25 14:06 mapleFU

@mapleFU @pitrou I believe this pull request is related to several other PRs I've submitted. Here's a summary:

1- API and Handling of the Last Buffer In this pull request, I demonstrated that it’s possible to share buffers without copying or finalizing the last buffer. This avoids relocating the buffer to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes.

2-

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

In this pull request, I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in this issue, I think this metadata could help determine when CompactArray should be called.

Overall, my suggestion is to either modify this pull request or create a new API to support buffer sharing. It is possible to decide whether a created array should be compacted based on some metadata, in order to avoid memory bloat.

Jun 29 '25 16:06 andishgar

Looks (1) would work in buffer style api, but for parquet reader, it might append buffer one by one.

The (2) is a good way for compute, but here I don't know the best way to handle this: whether to adaptive read it, or just throw it to "cast" or "compact". I prefer handling this in reader, and the later handling can "compact" the data when output or throwing to compute

Jun 30 '25 08:06 mapleFU

Besides parquet related issue, there is NO buffer sharing mechanism in current api.

for large string sharing, sharing through this interface will save a lot of memory usage
for inlined string it costs as much as directly Append.

These new interfaces won't introduce more cost for current interfaces from my view. I understand tradeoff on parquet as discussed above, but I believe this kind of buffer sharing interface is missing in current implementation, and there is definitely more need for this api apart from parquet.

Maybe we can open another issue/PR to discuss specifically whether/how parquet should use this api on appending a huge page and append view from it?

Jun 30 '25 14:06 IndifferentArea

Maybe we can open another issue/PR to discuss specifically whether/how parquet should use this api on appending a huge page and append view from it?

That sounds fair to me.

Jun 30 '25 14:06 pitrou

1- API and Handling of the Last Buffer In this pull request, I demonstrated that it’s possible to share buffers without copying or finalizing the last buffer. This avoids relocating the buffer to remove blank space, which can be a costly operation when the unused space exceeds 64 bytes.

2-

Is it a win, though? If most Parquet strings are <= 12 bytes we would pointlessly waste space and CPU time.

In this pull request, I proposed a method that could help avoid memory bloat when buffers are shared. Additionally, in this issue, I think this metadata could help determine when CompactArray should be called.

Thanks for the reminder, and sorry that this is taking a long time :) I propose that we review these PRs one by one. I've started with the CompactArray one and, once that is done, I would like to then move to the AppendArraySlice improvement.

This PR here is slightly more contentious so I think we should tackle it only after the other APIs have settled semantics.

Jul 01 '25 07:07 pitrou