mojo icon indicating copy to clipboard operation
mojo copied to clipboard

[BUG]: Creating a String from a DynamicVector drops the last character in the vector

Open sstadick opened this issue 1 year ago • 7 comments

Bug description

Converting a DynamicVector to a String using the String constructor results in a string that is missing the last element of the DynamicVecotor used to create it.

Steps to reproduce

The following code reproduces the issue:

fn test_stringify() raises:
    var example = DynamicVector[Int8]()
    example.append(ord("e"))
    example.append(ord("x"))

    var container = DynamicVector[Int8]()
    for i in range(len(example)):
        container.append(example[i])
    let stringifed = String(container)
    assert_equal("ex", stringifed)
    # Unhandled exception caught during execution: AssertionError: ex is not equal to e

System information

❯ mojo -v
mojo 0.7.0 (af002202)
❯ modular -v
modular 0.4.1 (2d8afe15)

On an M1 Mac.

sstadick avatar Feb 14 '24 22:02 sstadick

I'm not sure if this is a doc bug or an API design issue--maybe a bit of both.

The first constructor for String() takes a DynamicVector[Int8] argument which it expects to be a null-terminated array of characters. While that is the internal format that String uses, jit yields unexpected results in this case.

The constructors that take pointers also assume that the data is already null-terminated and that the provided len includes the terminator.

There's a private _unsafe_from_bytes() method that does exactly what @sstadick is looking to do here. (And a public asbytes() method that does the reverse).

TL;DR: It seems kind of backwards to me that the default paths in/out of String assume that the buffer coming in is null-terminated. But if that's the case we should just make sure the API docs for any method that takes a DynamicVector or Pointer mentions whether the data in question is null-terminated.

arthurevans avatar Feb 14 '24 23:02 arthurevans

So workaround is--use _unsafe_from_bytes() or use container.append(0) to null-terminate your string first.

arthurevans avatar Feb 14 '24 23:02 arthurevans

Wow, this looks like we should require a keyword argument here, so String(nul_terminated=container), and it should check/abort if not nul terminated.

lattner avatar Feb 15 '24 01:02 lattner

Thanks @arthurevans my question is answered, the behavior does at a minimum seem unexpected, feel free to close this if needed.

RE asbytes(), I don't see that in the String docs? https://docs.modular.com/mojo/stdlib/builtin/string.html Is there another place for docs that I should be looking, or is that maybe in a yet-to-be-public version?

sstadick avatar Feb 15 '24 16:02 sstadick

@arthurevans What I get from your response is that the internal representation of String requires null termination. Is that accurate? Why go with that design? Maybe we can add those information to the doc?

soraros avatar Feb 15 '24 18:02 soraros

@sstadick private methods (with a leading underscore) don't appear in the docs but do appear in code completion hints. In this case, I don't see it there, either, so it must be new. Sorry for the confusion. At any event, adding the null yourself is probably the better path for now. I've opened a PR to clarify the API docs as a short-term fix until the API can be updated.

@soraros yes, the internal buffer requires null termination. I believe this is not really for Mojo's sake, since Mojo tracks the length of the string. But it simplifies communicating with lower-level facilities that take C-style strings. (For example, if you open a file on Linux, at some point the Mojo std library will call into libc's open(), which takes the pathname as C-style string.)

arthurevans avatar Feb 16 '24 21:02 arthurevans

@arthurevans Thanks for the reply. That explanation is both foreseeable and unfortunate. However, why didn't Mojo go with a separate CString type like Rust and Swift? Can we at least have the rationale documented somewhere?

soraros avatar Feb 16 '24 22:02 soraros