emacs-module-rs icon indicating copy to clipboard operation
emacs-module-rs copied to clipboard

Enable conversion for Vec<u8>

Open cireu opened this issue 5 years ago • 8 comments

A "String" in Elisp VM can be used to present a byte slice Vec<u8>, not only a regular string. We can enable this conversion for the convenience. And we can deprecate utf-8-validation flag because user should use Vec<u8> to access a byte-slice and a string should always be valid utf-8.

cireu avatar Apr 02 '20 03:04 cireu

Do you have a particular use case in mind?

Emacs guarantees to give dynamic modules utf-8 encoded byte sequences. utf-8-validation is mainly useful when you think you've encountered an Emacs bug. You can call as_bytes() to get a Rust string's raw bytes.

Vec<u8>'s Lisp equivalences that dynamic modules can access are lists and vectors, not strings. Wholesale conversions for these are wasteful and inefficient, so I'd prefer modules to be explicit about it, in the spirit of "zero-cost abstractions".

ubolonton avatar Apr 02 '20 09:04 ubolonton

Emacs doesn't guarantee a "Lisp String" is a string in utf-8. A string can also be used to present a byte slice.

(encode-coding-string "你好" 'binary) ;; => "\344\275\240\345\245\275"
(stringp (encode-coding-string "你好" 'binary)) ;; => t

So it's better to separate a Bytes type from String

cireu avatar Apr 03 '20 05:04 cireu

And here's a example for Rust will failed to decode.

(seq-into [128 46 46 4194303] 'string)

cireu avatar Apr 03 '20 05:04 cireu

Some data serialization format may use binary data like Msgpack. These binary data will be presented in "Lisp String"(as byte slice) in tradition(there's bindat.el to handle). If we lacks support of this type, we have to close utf-8-validationand use into_bytes to extract raw bytes. This is unsafe because we have to ensure each Rust String to be used properly in a manual way.

cireu avatar Apr 03 '20 05:04 cireu

I'm aware that a Lisp string is not necessarily utf-8. It can be either a "unibyte" string (byte sequence) or a "multibyte" string (internally encoded in a superset of utf-8).

What I meant was, emacs-module.h and the documentation both say that copy_string_contents always returns a utf-8 byte sequence. I assumed Emacs would signal an error if the dynamic module calls copy_string_contents on a Lisp string that cannot be encoded in utf-8.

However, I did a quick test, and that assumption was wrong, so either it's an Emacs bug, or the documentation is wrong. I hope it's the latter (otherwise dynamic modules don't have access to raw bytes, by design). I'll have to investigate further.

ubolonton avatar Apr 03 '20 15:04 ubolonton

Eli's answer: If a string contains raw byte, calling copy_string_contents will signal an error.

https://lists.gnu.org/archive/html/emacs-devel/2020-10/msg00380.html

cireu avatar Oct 08 '20 10:10 cireu

That means the assumption of the Rust binding was correct, but Emacs has (or used to have, I haven't checked this recently) a bug where copy_string_contents doesn't signal an error when the string contains raw bytes.

ubolonton avatar Nov 08 '20 13:11 ubolonton

Since version 28, Emacs exposes a make_unibyte_string function. Unibyte strings are roughly arrays of bytes and would correspond well to Vec<u8> or perhaps better to &mut [u8]. Could you add support for unibyte strings or should I post a patch?

ellerh avatar Sep 19 '23 07:09 ellerh