Add byte-array-to-string conversion to system module
Summary
The standard library could use a function to convert an openarray[char] or openarray[byte] to a string.
Description
Converting between byte arrays and strings is pretty common, in my experience. It's possible to cast from a seq[char] or seq[byte] to a string using cast[string](...), and the compiler allows you to do the same with an openarray, but at runtime you get a garbage string object. (It appears that the initial bytes of the array get reinterpreted as the string's length and address.) Here's an example.
This seems rather dangerous: an idiom that works in one case fails in another, but subtly enough that it might be overlooked at first. Results I've observed have been either a several-megabyte string of garbage, or an out-of-memory exception (when Nim tries to allocate a copy of a terabytes-long string.) I'm guessing that mutating the string might lead to memory corruption ... and it might be possible to craft a byte array that overwrites specific memory addresses, making this a potential security exploit.
I realize the cast operator is clearly documented as dangerous! The problem here isn't that using it causes undefined behavior, rather that (a) it's the only simple way to convert an array to a string, and worse, (b) it works for some types of arrays but not others.
Proposal
Add some toString(...) procs to the system module:
proc toString(chars: openarray[char]): string
proc toString(bytes: openarray[byte]): string
($ would be a better name, but that operator is already defined in dollars.nim, and has a different purpose.)
Alternatives
The only way I've found is to create an empty string and copy the bytes:
proc toString(bytes: openarray[byte]): string =
result = newString(bytes.len)
copyMem(result[0].addr, bytes[0].unsafeAddr, bytes.len)
It's a fine solution, but we shouldn't make developers use highly unsafe language features just to perform a common operation! Instead, that should be the implementation of the standard library function.
I agree with the general idea (common cases like this should not require a cast for reasons you've already mentioned); but I'd do it a bit differently than what you're suggesting.
- not in system but in some low level module with no imports that can be imported by other modules; it could be either
system/dollar.nim(yuck sincesystem) so I'm leaning towards a newstd/stringslow-level module that other modules (including user code) can import - it should be compatible with https://github.com/nim-lang/RFCs/issues/191 in particular would take
var result - see also https://github.com/nim-lang/Nim/pull/13792 which has some relevance to this
compiler/strutils2.nimcould be merged with it
This would be really useful.
Ideally, it should be possible to get an immutable string from a openArray[byte] (and a way to get an immutable openArray[byte] from a string) without having to copyMem the datas.
I propose this solution in std or fusion:
func toByteSeq*(str: string): seq[byte] {.inline.} =
## Converts a string to the corresponding byte sequence.
@(str.toOpenArrayByte(0, str.high))
func toString*(bytes: openArray[byte]): string {.inline.} =
## Converts a byte sequence to the corresponding string.
let length = bytes.len
if length > 0:
result = newString(length)
copyMem(result.cstring, bytes[0].unsafeAddr, length)
Your inline here is cute, it's not like that the calling overhead of X nano seconds makes a difference for a memory allocation plus copy operation... The real problem here is insufficient design, if you used string for binary data just stick with it, don't convert it to seq of bytes back and forth...
Yes, sometimes I need this.
For examples, nimcrypto only provides openArray[byte] interface, I need to convert this to string.
proc randomString*(size = DefaultEntropy): string {.inline.} =
## Generates a new system random strings.
result = size.randomBytesSeq.fromByteSeq
And sometimes I need to debug openArray[byte] or for tests, I want to see the pretty format for openArray[byte] and I need stream interface.
let
serialize = [0'u8, 0, 2, 0, 8, 1, 71, 174, 20, 1, 99, 99, 0]
var str = fromByteSeq(serialize)
let
strm = newStringStream(str)
discard strm.readFrameHeaders
IMO, the functionality is needed.
When you use multiple external API, there are cases where you need one string and other where you need seq[byte].
My issue with the proposed solution :
- It duplicates datas.
- Having to allocate +
copyMemevery time is costly. - Ideally, there could very well be some APIs that store binary 16bit datas as
seq[uint16](not sure about this since casting seq between type of differentsizeofis tricky).
Something like this (or whatever syntax is best) :
var buffer = newBuffer(bufsize)
with buffer as seq[uint16]:
# ...
with buffer as seq[byte]:
# ...
with buffer as string:
# ...
relevant but doesn't close this issue: https://github.com/nim-lang/Nim/pull/15951
I don't think this is a "good first issue" at all because it's absolutely unclear what will be accepted and where to actually put it, even though the solution is pretty obvious.
Can't we just stick
template whenJSorVM(isTrue, isFalse: untyped) = # TODO: replace with a macro
when nimvm:
isTrue
else:
when (defined(js) or defined(nimdoc) or defined(nimscript)):
isTrue
else:
isFalse
func toString[T: char|byte](src: openArray[T]): string {.noinit.} =
## Returns a new string with contents copied from `src`.
result = newStringUninit(src.len)
whenJSorVM:
for i in 0..high(src):
result[i] = src[i]
do:
if src.len > 0:
copyMem(result[0].addr, src[0].unsafeAddr, src.len)
anywhere and mark it as unstable API and not wait for various tangential RFC (such as https://github.com/nim-lang/RFCs/issues/191)? This pattern is certainly very frequent in the wild so probably its existence is justified by now. This could be held under unstable API until we get string views or something else which is deemed "sufficiently designed".
Ok, so this definitely needs to be in the system.nim because it also has a substr which has to be tweaked to take openArray[char] as well and it begs to use toString then.
Code block in my previous comment updated.
I agree with @xflywind : Interopability between different APIs in the same codebase would be improved with this feature.
This can be closed as this was sneaked into system as a substr overload with ea0e45e (#20527).