nit icon indicating copy to clipboard operation
nit copied to clipboard

`NativeString::to_s` services should offer the safest behavior first

Open xymus opened this issue 9 years ago • 6 comments

The to_s services on NativeString have evolved gradually in the last few year, as a result the API has gotten inconsistent and hard to grasp. Another root problem is that there are 2 kinds of users of these services, FFI users and string implementers. Right off the bat, I would argue that the FFI users are in greater numbers and usually more novice than the string implementers who tend write a thesis on the subject.

The goal of this issue is to improve the API for it to offer a more progressive learning curve for new users.

Notable problems:

  1. There's an ambiguity on what is "safe", as there are 2 things to do to be safe: copy data into Nit memory and clean UTF-8. I propose to avoid "safe" and use instead "copy" vs "in place" and "clean" vs "dirty" (or "raw"?). The expressions "in place" and "dirty" should discourage the use of services made available for optimisation only.

  2. to_s_full is also imprecise (full of what? for non-string implementers it's hard to say), I propose to_s_in_place_with_unicode_length (@R4PaSs it could as well be ``to_s_in_place_with_bytelen` if it makes more sense to you). The name gets long but it should be used almost only internally. As of now, it is called 4 times in the Nit repo, 3 of which in generated code.

  3. As of now, NativeString::to_s is not documented and it has an unpredictible behavior. If the string is UTF-8 clean, the data is used in place, if it is dirty it is copied into Nit memory and cleaned.

    I propose that by default to_s services copy into memory and clean the UTF-8. This is the noob safe behavior which protect well other string services. It is also the desired behavior of 99% FFI use cases. Note that all of the FFI code in the Nit lib use the copy and clean services, or should use them and may crash at any moment.

    We keep the other to_s services for optimized manipulations. These can be used by experts when they know that they are safe to use, this includes in the compiler or in performance critical FFI modules.

~~4. We don't need so much with_length as only to_s doesn't have a length, it is already in the signature, and there are details on each method that are more important.~~

So here is the proposed new API, the old names have been moved to the comments at the end of each signatures.

# Get a `String` from the data at `self` copied into Nit memory
#
# Require: `self` is a null-terminated string.
fun to_s: String # Old: to_s_with_copy

# Get a `String` from `length` bytes at `self` copied into Nit memory
fun to_s_with_length(byte_length: Int): String # Old: to_s_with_copy_and_length

# Get a `String` from the null-terminated string at `self`
#
# Require: `self` is a null-terminated string.
#
# The result may point to the data at `self` or
# it may make a copy in Nit controlled memory.
# This method should only be used when `self` was allocated by the Nit GC,
# or when manually controlling the deallocation of `self`.
fun to_s_in_place: String # Old: to_s

# Get a `String` from `length` bytes at `self`
#
# The result may point to the data at `self` or
# it may make a copy in Nit controlled memory.
# This method should only be used when `self` was allocated by the Nit GC,
# or when manually controlling the deallocation of `self`.
fun to_s_in_place_with_length(byte_length: Int): String # Old: to_s_with_length

# Get a `String` from the raw `length` bytes at `self`
#
# The default value of `length` is the number of bytes before
# the first null character.
#
# The created `String` points to the data at `self`.
# This method should be used when `self` was allocated by the Nit GC,
# or when manually controlling the deallocation of `self`.
#
# /!\: This service does not clean the items for compliance with UTF-8,
# use only when the data has already been verified as valid UTF-8.
fun to_s_in_place_dirty(byte_length: nullable Int): String # Old: to_s_unsafe

# Get a `String` from the raw `bytelen` bytes at `self` with `unilen` Unicode characters
#
# The created `String` points to the data at `self`.
# This method should be used when `self` was allocated by the Nit GC,
# or when manually controlling the deallocation of `self`.
#
# /!\: This service does not clean the items for compliance with UTF-8,
# use only when the data has already been verified as valid UTF-8.
#
# SEE: `abstract_text::Text` for more info on the difference
# between `Text::bytelen` and `Text::length`.
fun to_s_in_place_with_unicode_length(byte_length, unicode_len: Int): String # Old: to_s_full
fun to_s_in_place_with_bytelen(byte_length, unicode_len: Int): String # I'm not convinced about this one, so here's an alternative
fun to_s_in_place_dirty_with_bytelen(byte_length, unicode_len: Int): String # Just to really discourage its use

ping @R4PaSs and @privat

xymus avatar May 06 '16 13:05 xymus

I updated the description to keep the behavior of the current to_s. It invalidates partially my point about with_length which remains useful is some cases.

xymus avatar May 06 '16 13:05 xymus

WTF? Are you trying to make some core lib of Nit POLA and KISS? +1

Some remarks:

  • what is the expected behavior of in_place if the string is not UTF8 clean?
  • Sometimes length is a nullable parameter, sometimes there is two functions. Is there a reason?
  • I'm starting to be confused. length in String is the unilen. strlen in C is the bytelen. Whene there is only a single length parameter in the C->Nit, eg. to_s_with_length, it might be not clear which one it is? Maybe use unilen or bytelen even if there only one argument?

privat avatar May 06 '16 13:05 privat

  • The string is copied if not UTF8 clean, unless the method is _dirty. It could be more of a to_s_try_in_place but I'm not sure it's necessary as the doc should describe it.

    I'm not a fan of these "we don't know what will happen" services, but I'm told that they are used internally for optimization. I imagine when passing from raw bytes already in Nit memory to strings. For FFI users, we could also expose a check service NativeString::is_clean_utf8 which would make it safe to call to_s_in_place_dirty afterwards, but it's just moving the problem on the client's side.

  • I didn't notice that nullable parameter... The first 4 methods should not use nullable parameters for compatibility with the FFI, but we could use them for the more complex/optimized/internal/dirty services.

    fun to_s_optimized(clean, copy: nullable Bool, byte_length, chars_length: nullable Int)

  • We could very well standardize the names, the parameter length is really Text::bytelen, and unilen is the real Text::length. Is this right @R4PaSs ?

xymus avatar May 06 '16 14:05 xymus

The proposed API suits me well.

Agreed, names can be standardized, when talking about NativeString, the length is effectively its bytelen, and its unilen is the length of a Text object.

I'm not a huge fan of the is_clean_utf8 approach, since the service can do it by its own, it would just cost more to first check and re-allocate/clean afterwards. The actual clean_utf8 was designed in the current way with this in mind.

The nullable/optional approach was considered, but yeah, it'd make all calls within FFI code heavier so I dropped it. Though Nit code using the to_s variants on NativeString could benefit from it.

lbajolet avatar May 06 '16 17:05 lbajolet

I've updated the issue description with better parameter names.

Another question, what do you think about replacing all the unsafe/optimized alternatives (to_s_in_place, to_s_in_place_with_length, to_s_in_place_dirty, to_s_in_place_with_unicode_length) with the single fun to_s_optimized(clean, copy: nullable Bool, byte_length, chars_length: nullable Int) (or similar)?

So there would only be 3 methods, where only the first 2 are recommended to use from FFI code:

fun to_s: String
fun to_s_with_length(byte_length: Int): String
fun to_s_optimized(clean, copy: nullable Bool, byte_length, unicode_length: nullable Int) # which default to the behavior of `to_s`

xymus avatar May 12 '16 16:05 xymus

Or, instead of to_s_optimized, we keep the name of the existing to_s_unsafe. So the method sounds like "choose your dangerous way to play with strings".

xymus avatar May 12 '16 16:05 xymus