tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Expose `Encoding` attributes via the buffer protocol interface

Open mariosasko opened this issue 6 months ago • 3 comments

This PR enables access to the underlying buffers of an Encoding object via the buffer protocol interface, allowing for efficient conversion from Rust to Python for types that support that interface (e.g., NumPy, PyTorch, PyArrow).

This can save >20% of time when tokenizing datasets (with longer sequences) based on my benchmarks.

mariosasko avatar Jun 04 '25 00:06 mariosasko

Hey, thanks for the PR, however, if you've noticed you removed the abi-py38 flag, which makes this code non portable.

Buffers were stabilized in Py 3.11 https://docs.python.org/3.11/c-api/buffer.html#bufferobjects so we most likely will have to wait in order to get this rolling : https://devguide.python.org/versions/

I have tried in safetensors to get something sound using feature flags to use the buffers only on those version but honestly it's super messy to distribute various ABIs, keep the code clean and still give those features.

If anyone has suggestions on how to get the best of all worlds, we're all ears.

Narsil avatar Jun 16 '25 13:06 Narsil

Hi! This API is slightly advanced, so I guess it can wait 🙂

I have tried in safetensors to get something sound using feature flags to use the buffers only on those version but honestly it's super messy to distribute various ABIs, keep the code clean and still give those features.

For instance, pyca/cryptography is using the feature flags to support the buffer interface, but this indeed adds some complexity, so probably not worth it.

mariosasko avatar Jun 17 '25 15:06 mariosasko

I checked cryptography, it doesn't seem like they are using the abi3 features https://github.com/pyca/cryptography/blob/fe5ba4dafaf927be60066e7b6b4763524934faf3/src/rust/src/buf.rs#L31

https://github.com/pyca/cryptography/blob/fe5ba4dafaf927be60066e7b6b4763524934faf3/src/rust/Cargo.toml#L32-L34

That's where I had the issues when I did something similar in safetensors. The issue is that there is no nice way to keep a simple build system (pip install -e . for instance) by detecting the current python version in CLI and still keep something relatively simple in distributed builds too.

Narsil avatar Jun 18 '25 08:06 Narsil