julia icon indicating copy to clipboard operation
julia copied to clipboard

Add takestring! and make `String(::Memory)` not truncate

Open jakobnissen opened this issue 9 months ago • 23 comments

This function creates a string from a byte vector, truncating the vector and reusing its memory where possible.

This also removes the truncating behaviour of String(::Memory) - see #54369

Still needs to be done

  • [x] Make tests pass
  • [x] Comb through Base to find places where take_string!(::Memory) should be used instead of String(::Memory) for efficiency [edit: couldn't find any]

Closes #54369

jakobnissen avatar May 06 '24 06:05 jakobnissen

Also, check for uses of StringMemory for example: https://github.com/JuliaLang/julia/blob/f3561ae5af05713480aa9a8166501449305a0339/stdlib/FileWatching/src/pidfile.jl#L283-L288 #53962

nhz2 avatar May 06 '24 12:05 nhz2

The name makes me think that take_string!(iobuffer) should be a replacement/synonym for String(take!(iobuffer))?

stevengj avatar May 06 '24 22:05 stevengj

I would be in favor of that. Every time I have used an IOBuffer I have had to go to the docs to figure out how to get the data out of it correctly.

oscardssmith avatar May 07 '24 00:05 oscardssmith

So what's the specification of String(_): does it truncate or not? Or does it depend on the specific argument type? An "it depends" is not a very nice property for generic programming.

jariji avatar May 07 '24 18:05 jariji

It does not truncate - except when the argument is a Vector{UInt8}. Yes, that's bad and in my opinion it shouldn't have been like that. But it's documented, and therefore can't change. I made this PR to make sure more exceptions aren't added.

jakobnissen avatar May 07 '24 19:05 jakobnissen

Do we need an underscore here? takestring! is pretty readable

stevengj avatar May 09 '24 14:05 stevengj

Triage had a long talk about this, https://github.com/JuliaLang/julia/issues/54369, https://github.com/JuliaLang/julia/issues/54156 and https://github.com/JuliaLang/julia/pull/54273. The conclusions we came to are

  1. String(Memory) should copy
  2. String(Array) should truncate the array but not the Memory
  3. triage was right last time wrt view and reshape of Memory
  4. using the an Array aliased with another Array that is turned into a String is UB.
  5. take_string! is a much better API for IOBuffer et al.

oscardssmith avatar May 09 '24 15:05 oscardssmith

The style guide says not to use underscores "when readable" without, which I think applies here.

stevengj avatar May 09 '24 19:05 stevengj

  1. using the an Array aliased with another Array that is turned into a String is UB.

Can this be refined to "mutating an Array after its alias has been turned into a String is UB"?

edit: For the record , I don't think it's right for us to put this contract on an API that's not marked unsafe. I expect that this 'contract' is broken much more often than the UB is actually triggered though, which is probably what's saving us in practice.

topolarity avatar May 10 '24 16:05 topolarity

Yes. observing it is not great but not UB. the only UB is if you mutate it.

oscardssmith avatar May 10 '24 17:05 oscardssmith

takestring! can mutate the bytes in the underlying memory of a Vector input, it is not just making a String view.

nhz2 avatar May 10 '24 17:05 nhz2

After triage, this PR now contains the following changes:

  • [x] Add takestring!. This function can be used with IOBuffer and Vector{UInt8}, returning a String. It resets the input to its initial state. This means it empties vectors, and empties IOBuffer if and only if the buffer is writable. This behaviour of not empying the input if the input is not writable is slightly weird, but matches take!
  • [x] Add the internal methods Base.unsafe_takestring! and Base.unsafe_takestring, which are similar to takestring!, except they leave the argument (IOBuffer or Memory) in an unusable corrupt state. They are useful for the common pattern when code creates a buffer, takes a string from the buffer and then returns the string without touching the buffer again.
  • [x] String(::Memory) now no longer truncates the memory
  • [x] String(::Vector{UInt8}) now no longer truncates the underlying memory of the string. It still empties the vector, and assigns a new (empty) memory to the vector just like before.
  • [x] The many different calls to String(take!(::IOBuffer)) and String(_unsafe_take!(::IOBuffer)) has been replaced with takestring! and the two internal methods Base.unsafe_takestring! and Base.unsafe_takestring). Not all calls to String(take!(...)) has been replaced, as there are many, many of such calls in the test suite and no need to change them.
  • [x] Replace most uses of String(take!(::IOBuffer)) in Base and the stdlibs
  • [x] Add tests

jakobnissen avatar May 10 '24 18:05 jakobnissen

This should be good to go as soon as tests pass. See this comment for a complete list of changes in this PR: https://github.com/JuliaLang/julia/pull/54372#issuecomment-2105059874

jakobnissen avatar May 13 '24 11:05 jakobnissen

The current behavior is already approved by triage (https://github.com/JuliaLang/julia/issues/24388#issuecomment-341546898) so it feels a bit odd that there is a new triage later that just goes and overrides this without any reference (at least from what I can see) to the original decision.

This also lacks review from people heavily invested in the original design like @JeffBezanson.

KristofferC avatar May 13 '24 14:05 KristofferC

@KristofferC It is useful context and was not discussed in the previous triage call. That said, I think none of what triage said in this call contradicts the previous call. Almost all of the decision points for triage here were about the behavior when Memory is involved which didn't exist in 2017. The reason re-discussion of such a similar issue was needed is that as of the previous conversation, the solution was to 0 out the length field which was relatively bullet-proof since Julia at the time didn't let you access the underlying fixed length thing upon which Vector is built. As of Julia 1.11, Array is a "normal" julia object, and Memory is the fixed length backing, so the conversation got pushed a layer further down, at which point the previous solution no longer works without some modification.

The one part where the current triage call somewhat diverged from the previous is the decision that a new function that does the combination of removing data from an IOBuffer and turning that into a String is a good idea. That doesn't seem to have been on the radar as a possible option in the original triage decision, and I do not know why.

oscardssmith avatar May 13 '24 15:05 oscardssmith

Is the problem here then that we don't have any way to do a safe runtime "move" out of a Memory?

I'm aware of three options for a copy/performance tradeoff like this:

  • guarantee a copy
  • leave behind an empty object (C++-style "move")
  • leave behind an invalid object (e.g. one which may be potentially freed or whose mutation violates immutability, etc. - triggers UB)

I think our policy on the ownership theft of String was that it was allowed because we implemented a runtime "move" in the style of (2). The user might see unexpected mutation/moves, but usage of the left-over object doesn't directly trigger UB.

I'm assuming that's not an option for Memory - because of the assumption that it's always allocated and fixed-size?

topolarity avatar May 13 '24 20:05 topolarity

String(Memory) has to copy according to the docs of String since Memory is an AbstractVector{UInt8} and isn't a Vector{UInt8}.

nhz2 avatar May 13 '24 21:05 nhz2

On @KristofferC 's recommendation, I have now read through the old decisions in #24388, #25241, #25846 and #26093. I still believe the changes in this PR and #54452 are for the better.

In #24388, the two main points are: 1. It should not be possible to mutate strings except through unsafe_. 2. Always requiring a copy is a performance disaster

The other PRs are less relevant - they pertain to codeunits and unsafe_wrap, except #26093 which reiterates that strings should not be mutated, and also says:

One place where something like String! or string! would be valuable though is to replace String(take!(iobuffer)). Perhaps take!(iobuffer, String). That would allow us to avoid an extra allocation since we know the caller doesn't want the vector, just the string underneath, so the buffer can reuse the vector object (just not its storage).

Which essentially suggests encapsulating this whole sketchy thing with creating strings from mutable data in a single function, as done here.

So, I just don't see how the old triage decisions conflict with the PR here, and given that triage has approved this already, I think we should move forward with it.

To motivate it a little more, I think the status quo is not acceptable because it leads to memory bugs too easily:

Here is a way that works on released versions of Julia unless this PR or #54452 is applied to mutate strings

b = IOBuffer()
write(b, "abc")
v1 = take!(b)
v2 = view(v1, :)
s = String(v1) # "abc"
fill!(v2, 0)
s # now all zeros

Here is a bug on master where a vector now is able to read from out of bounds (fixed by this PR):

v1 = Memory{UInt8}(codeunits("abc"))
v2 = view(v1, 1:3)
s = String(v1)
println(isempty(v2.ref.mem)) # true
v2[1] # accesses v2.ref.mem OOB

More broadly speaking, I'm not suggesting that we compromise on our APIs in order to categorically prevent unsafe behaviour. We're not Rust, and need to be practical about what is unsafe. However:

  1. This is not some crazy edge case, it's pretty easily accessible memory bugs that can happen by accident way too easily through no obvious fault of the user

  2. This is not an API compromise. In my opinion, takestring! provided here, and the change in #54452 makes Julia's API more consistent, and less surprising, because it allows users to not have to rely on the weird magic behaviour of truncating String, while not providing people with vectors that are sometimes string-backed and sometimes not.

jakobnissen avatar May 14 '24 11:05 jakobnissen

@jakobnissen Thank you for your tireless efforts here. I do like takestring! and I agree with the approach recommended by triage. See comments about ensuring copies are avoided where possible.

From the implementation here the difference between takestring! and unsafe_takestring! is not clear to me. Can an unsafe version share data in more cases? And my understanding is that the unsafe version would be used internally when we know the IOBuffer has no other references? However in the changes here it is not consistently used that way. Maybe we don't need both functions for IOBuffer?

JeffBezanson avatar May 17 '24 20:05 JeffBezanson

Thanks for the feedback. I've now changed both takestring! and unsafe_takestring! to avoid making copies, where possible. I didn't do that previously, because I mistakenly thought that IOBuffers' memory wasn't backed by strings so it would copy anyway. That was weird, because I did know that it was supposed to be zero-copy, but I guess after revising it a few times I lost the forest for the trees.

The unsafe variant is a tiny bit faster because a) it assumes the buffer is writable and seekable, and b) it doesn't spend time resetting the buffer to a usable state. In benchmarks, it's only about 10% faster, but still, worth having IMO.

Edit: I think the 32-bit Windows test failure is spurious. Tests do seem to pass.

Edit2: With these changes, takestring! is now a little faster than String(take!(buf)): The following code

function foo(data)
    buf = IOBuffer()
    write(buf, data)
    String(take!(buf)) # or takestring!(buf) or unsafe_takestring!(buf)
end

@benchmark foo("abc")

Takes the following time:

  • String(take!(buf)) on 1.11-beta1: 74 ns
  • String(take!(buf)) on this PR: 64 ns
  • takestring!(buf): 54 ns
  • unsafe_takestring!(buf): 48 ns

jakobnissen avatar May 18 '24 12:05 jakobnissen

Rebased. Test failures look unrelated, but it's suspicious that they now failed twice on Windows.

jakobnissen avatar May 27 '24 12:05 jakobnissen

Bump. What needs to be done to get this merged?

jakobnissen avatar Jun 23 '24 12:06 jakobnissen

FYI FWIW there are unresolved conflicts, as indicated by Github.

nsajko avatar Jun 23 '24 12:06 nsajko

Bump

jakobnissen avatar Jul 22 '24 11:07 jakobnissen

I'm not completely sold on the name unsafe_takestring(::Memory). It seems quite different from unsafe_takestring!(::IOBuffer), since it doesn't "take" anything from the Memory.

Maybe unsafe_string(::Memory)? (This would export it, since unsafe_string is an exported symbol, but if we wanted to avoid this for now we could call it _unsafe_string(::Memory).)

stevengj avatar Jul 24 '24 13:07 stevengj

IMO it's fine. It's similar to unsafe_takestring!, which doesn't take anything from the IOBuffer either. Or, viewed from another angle, it does take the memory away in the sense that the memory cannot be safely modified after use, because the ownership has been "taken" by someone else. That being said, I don't mind if it's changed.

jakobnissen avatar Jul 24 '24 13:07 jakobnissen

After a call to unsafe_takestring(::Memory) it is also unsafe to read from the memory because Julia can free the memory. For example, here is an example that leads to a seg fault.

function foo(n)
    m = Base.StringMemory(n)
    m .= 0x43
    Base.unsafe_takestring(m)
    m
end

m = foo(1_000_000)
GC.gc()
GC.gc()
GC.gc()
GC.gc()
m[1]

nhz2 avatar Jul 24 '24 14:07 nhz2

Bump - all comments have been addressed. Failures seem unrelated.

jakobnissen avatar Sep 09 '24 09:09 jakobnissen

If it matters, this SGTM. Even if we keep around the mutating behavior of String(Vector{UInt8}) for now until v2, it still seems good for readability to add these more precise function names now.

vtjnash avatar Oct 25 '24 15:10 vtjnash