Nim icon indicating copy to clipboard operation
Nim copied to clipboard

Add byte-array-to-string conversion to system module

Open snej opened this issue 5 years ago • 10 comments

Summary

The standard library could use a function to convert an openarray[char] or openarray[byte] to a string.

Description

Converting between byte arrays and strings is pretty common, in my experience. It's possible to cast from a seq[char] or seq[byte] to a string using cast[string](...), and the compiler allows you to do the same with an openarray, but at runtime you get a garbage string object. (It appears that the initial bytes of the array get reinterpreted as the string's length and address.) Here's an example.

This seems rather dangerous: an idiom that works in one case fails in another, but subtly enough that it might be overlooked at first. Results I've observed have been either a several-megabyte string of garbage, or an out-of-memory exception (when Nim tries to allocate a copy of a terabytes-long string.) I'm guessing that mutating the string might lead to memory corruption ... and it might be possible to craft a byte array that overwrites specific memory addresses, making this a potential security exploit.

I realize the cast operator is clearly documented as dangerous! The problem here isn't that using it causes undefined behavior, rather that (a) it's the only simple way to convert an array to a string, and worse, (b) it works for some types of arrays but not others.

Proposal

Add some toString(...) procs to the system module:

proc toString(chars: openarray[char]): string
proc toString(bytes: openarray[byte]): string

($ would be a better name, but that operator is already defined in dollars.nim, and has a different purpose.)

Alternatives

The only way I've found is to create an empty string and copy the bytes:

proc toString(bytes: openarray[byte]): string =
  result = newString(bytes.len)
  copyMem(result[0].addr, bytes[0].unsafeAddr, bytes.len)

It's a fine solution, but we shouldn't make developers use highly unsafe language features just to perform a common operation! Instead, that should be the implementation of the standard library function.

snej avatar Jun 25 '20 16:06 snej

I agree with the general idea (common cases like this should not require a cast for reasons you've already mentioned); but I'd do it a bit differently than what you're suggesting.

  • not in system but in some low level module with no imports that can be imported by other modules; it could be either system/dollar.nim (yuck since system) so I'm leaning towards a new std/strings low-level module that other modules (including user code) can import
  • it should be compatible with https://github.com/nim-lang/RFCs/issues/191 in particular would take var result
  • see also https://github.com/nim-lang/Nim/pull/13792 which has some relevance to this
  • compiler/strutils2.nim could be merged with it

timotheecour avatar Jun 25 '20 21:06 timotheecour

This would be really useful.

Ideally, it should be possible to get an immutable string from a openArray[byte] (and a way to get an immutable openArray[byte] from a string) without having to copyMem the datas.

Clonkk avatar Oct 22 '20 09:10 Clonkk

I propose this solution in std or fusion:

func toByteSeq*(str: string): seq[byte] {.inline.} =
  ## Converts a string to the corresponding byte sequence.
  @(str.toOpenArrayByte(0, str.high))

func toString*(bytes: openArray[byte]): string {.inline.} =
  ## Converts a byte sequence to the corresponding string.
  let length = bytes.len
  if length > 0:
    result = newString(length)
    copyMem(result.cstring, bytes[0].unsafeAddr, length)

ringabout avatar Nov 04 '20 15:11 ringabout

Your inline here is cute, it's not like that the calling overhead of X nano seconds makes a difference for a memory allocation plus copy operation... The real problem here is insufficient design, if you used string for binary data just stick with it, don't convert it to seq of bytes back and forth...

Araq avatar Nov 04 '20 15:11 Araq

Yes, sometimes I need this. For examples, nimcrypto only provides openArray[byte] interface, I need to convert this to string.

proc randomString*(size = DefaultEntropy): string {.inline.} =
  ## Generates a new system random strings.
  result = size.randomBytesSeq.fromByteSeq

And sometimes I need to debug openArray[byte] or for tests, I want to see the pretty format for openArray[byte] and I need stream interface.

    let 
      serialize = [0'u8, 0, 2, 0, 8, 1, 71, 174, 20, 1, 99, 99, 0]

    var str = fromByteSeq(serialize)
    let 
      strm = newStringStream(str)

    discard strm.readFrameHeaders

ringabout avatar Nov 04 '20 15:11 ringabout

IMO, the functionality is needed.

When you use multiple external API, there are cases where you need one string and other where you need seq[byte].

My issue with the proposed solution :

  • It duplicates datas.
  • Having to allocate + copyMem every time is costly.
  • Ideally, there could very well be some APIs that store binary 16bit datas as seq[uint16] (not sure about this since casting seq between type of different sizeof is tricky).

Something like this (or whatever syntax is best) :

var buffer = newBuffer(bufsize)

with buffer as seq[uint16]:
  # ...
with buffer as seq[byte]:
  # ...
with buffer as string:
  # ...

Clonkk avatar Nov 04 '20 16:11 Clonkk

relevant but doesn't close this issue: https://github.com/nim-lang/Nim/pull/15951

timotheecour avatar Nov 13 '20 16:11 timotheecour

I don't think this is a "good first issue" at all because it's absolutely unclear what will be accepted and where to actually put it, even though the solution is pretty obvious.

Can't we just stick

template whenJSorVM(isTrue, isFalse: untyped) = # TODO: replace with a macro
  when nimvm:
    isTrue
  else:
    when (defined(js) or defined(nimdoc) or defined(nimscript)):
      isTrue
    else:
      isFalse

func toString[T: char|byte](src: openArray[T]): string {.noinit.} =
  ## Returns a new string with contents copied from `src`.
  result = newStringUninit(src.len)
  whenJSorVM:
    for i in 0..high(src):
      result[i] = src[i]
  do:
    if src.len > 0:
      copyMem(result[0].addr, src[0].unsafeAddr, src.len)

anywhere and mark it as unstable API and not wait for various tangential RFC (such as https://github.com/nim-lang/RFCs/issues/191)? This pattern is certainly very frequent in the wild so probably its existence is justified by now. This could be held under unstable API until we get string views or something else which is deemed "sufficiently designed".

ZoomRmc avatar Apr 16 '22 18:04 ZoomRmc

Ok, so this definitely needs to be in the system.nim because it also has a substr which has to be tweaked to take openArray[char] as well and it begs to use toString then.

Code block in my previous comment updated.

ZoomRmc avatar May 12 '22 17:05 ZoomRmc

I agree with @xflywind : Interopability between different APIs in the same codebase would be improved with this feature.

saemideluxe avatar Jun 03 '22 15:06 saemideluxe

This can be closed as this was sneaked into system as a substr overload with ea0e45e (#20527).

ZoomRmc avatar Sep 28 '25 23:09 ZoomRmc