tokenizers
tokenizers copied to clipboard
Encoding.pad/truncate return () but could return the Encoding to chain calls
Encoding provides 2 functions to pad and truncate the current encoding.
But the pad/truncate functions return ()
so you can't chain calls on the Encoding structure such as:
encoding = tokenizer.encode(...)
encoding.truncate(...).pad(...).ids
If pad/truncate were returning Result<Encoding>
, we could chain calls.
Is there any reason or choice for this and is my proposition non-sense (which can be possible)?
I'd say the main reason is that the main path we imagined for Truncation and Padding, is by using with_truncation
and with_padding
on the Tokenizer
directly. By doing so, the Encoding
returned by encode
is already truncated and padded, and you don't have to do it individually.
That being said, I don't see any reason that would prevent us from changing this. Feel free to make a PR!
I see! That wasn't clear when I was watching the API in my first look. I'll see if I can make a PR for that, I need to refresh my Rust tokenizer ;)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.