julia
julia copied to clipboard
Add takestring! and make `String(::Memory)` not truncate
This function creates a string from a byte vector, truncating the vector and reusing its memory where possible.
This also removes the truncating behaviour of String(::Memory)
- see #54369
Still needs to be done
- [x] Make tests pass
- [x] Comb through Base to find places where
take_string!(::Memory)
should be used instead ofString(::Memory)
for efficiency [edit: couldn't find any]
Closes #54369
Also, check for uses of StringMemory
for example:
https://github.com/JuliaLang/julia/blob/f3561ae5af05713480aa9a8166501449305a0339/stdlib/FileWatching/src/pidfile.jl#L283-L288
#53962
The name makes me think that take_string!(iobuffer)
should be a replacement/synonym for String(take!(iobuffer))
?
I would be in favor of that. Every time I have used an IOBuffer I have had to go to the docs to figure out how to get the data out of it correctly.
So what's the specification of String(_)
: does it truncate or not? Or does it depend on the specific argument type? An "it depends" is not a very nice property for generic programming.
It does not truncate - except when the argument is a Vector{UInt8}
. Yes, that's bad and in my opinion it shouldn't have been like that. But it's documented, and therefore can't change. I made this PR to make sure more exceptions aren't added.
Do we need an underscore here? takestring!
is pretty readable
Triage had a long talk about this, https://github.com/JuliaLang/julia/issues/54369, https://github.com/JuliaLang/julia/issues/54156 and https://github.com/JuliaLang/julia/pull/54273. The conclusions we came to are
-
String(Memory)
should copy -
String(Array)
should truncate the array but not the Memory - triage was right last time wrt
view
andreshape
ofMemory
- using the an
Array
aliased with anotherArray
that is turned into aString
is UB. -
take_string!
is a much better API for IOBuffer et al.
The style guide says not to use underscores "when readable" without, which I think applies here.
- using the an Array aliased with another Array that is turned into a String is UB.
Can this be refined to "mutating an Array after its alias has been turned into a String is UB"?
edit: For the record , I don't think it's right for us to put this contract on an API that's not marked unsafe
. I expect that this 'contract' is broken much more often than the UB is actually triggered though, which is probably what's saving us in practice.
Yes. observing it is not great but not UB. the only UB is if you mutate it.
takestring!
can mutate the bytes in the underlying memory of a Vector
input, it is not just making a String
view.
After triage, this PR now contains the following changes:
- [x] Add
takestring!
. This function can be used withIOBuffer
andVector{UInt8}
, returning aString
. It resets the input to its initial state. This means it empties vectors, and emptiesIOBuffer
if and only if the buffer is writable. This behaviour of not empying the input if the input is not writable is slightly weird, but matchestake!
- [x] Add the internal methods
Base.unsafe_takestring!
andBase.unsafe_takestring
, which are similar totakestring!
, except they leave the argument (IOBuffer
orMemory
) in an unusable corrupt state. They are useful for the common pattern when code creates a buffer, takes a string from the buffer and then returns the string without touching the buffer again. - [x]
String(::Memory)
now no longer truncates the memory - [x]
String(::Vector{UInt8})
now no longer truncates the underlying memory of the string. It still empties the vector, and assigns a new (empty) memory to the vector just like before. - [x] The many different calls to
String(take!(::IOBuffer))
andString(_unsafe_take!(::IOBuffer))
has been replaced withtakestring!
and the two internal methodsBase.unsafe_takestring!
andBase.unsafe_takestring
). Not all calls toString(take!(...))
has been replaced, as there are many, many of such calls in the test suite and no need to change them. - [x] Replace most uses of
String(take!(::IOBuffer)) in Base and the stdlibs
- [x] Add tests
This should be good to go as soon as tests pass. See this comment for a complete list of changes in this PR: https://github.com/JuliaLang/julia/pull/54372#issuecomment-2105059874
The current behavior is already approved by triage (https://github.com/JuliaLang/julia/issues/24388#issuecomment-341546898) so it feels a bit odd that there is a new triage later that just goes and overrides this without any reference (at least from what I can see) to the original decision.
This also lacks review from people heavily invested in the original design like @JeffBezanson.
@KristofferC It is useful context and was not discussed in the previous triage call. That said, I think none of what triage said in this call contradicts the previous call. Almost all of the decision points for triage here were about the behavior when Memory
is involved which didn't exist in 2017. The reason re-discussion of such a similar issue was needed is that as of the previous conversation, the solution was to 0 out the length field which was relatively bullet-proof since Julia at the time didn't let you access the underlying fixed length thing upon which Vector
is built. As of Julia 1.11, Array
is a "normal" julia object, and Memory
is the fixed length backing, so the conversation got pushed a layer further down, at which point the previous solution no longer works without some modification.
The one part where the current triage call somewhat diverged from the previous is the decision that a new function that does the combination of removing data from an IOBuffer
and turning that into a String
is a good idea. That doesn't seem to have been on the radar as a possible option in the original triage decision, and I do not know why.
Is the problem here then that we don't have any way to do a safe runtime "move" out of a Memory?
I'm aware of three options for a copy/performance tradeoff like this:
- guarantee a copy
- leave behind an empty object (C++-style "move")
- leave behind an invalid object (e.g. one which may be potentially freed or whose mutation violates immutability, etc. - triggers UB)
I think our policy on the ownership theft of String
was that it was allowed because we implemented a runtime "move" in the style of (2). The user might see unexpected mutation/moves, but usage of the left-over object doesn't directly trigger UB.
I'm assuming that's not an option for Memory
- because of the assumption that it's always allocated and fixed-size?
String(Memory)
has to copy according to the docs of String
since Memory
is an AbstractVector{UInt8}
and isn't a Vector{UInt8}
.
On @KristofferC 's recommendation, I have now read through the old decisions in #24388, #25241, #25846 and #26093. I still believe the changes in this PR and #54452 are for the better.
In #24388, the two main points are:
1. It should not be possible to mutate strings except through unsafe_
.
2. Always requiring a copy is a performance disaster
The other PRs are less relevant - they pertain to codeunits
and unsafe_wrap
, except #26093 which reiterates that strings should not be mutated, and also says:
One place where something like String! or string! would be valuable though is to replace String(take!(iobuffer)). Perhaps take!(iobuffer, String). That would allow us to avoid an extra allocation since we know the caller doesn't want the vector, just the string underneath, so the buffer can reuse the vector object (just not its storage).
Which essentially suggests encapsulating this whole sketchy thing with creating strings from mutable data in a single function, as done here.
So, I just don't see how the old triage decisions conflict with the PR here, and given that triage has approved this already, I think we should move forward with it.
To motivate it a little more, I think the status quo is not acceptable because it leads to memory bugs too easily:
Here is a way that works on released versions of Julia unless this PR or #54452 is applied to mutate strings
b = IOBuffer()
write(b, "abc")
v1 = take!(b)
v2 = view(v1, :)
s = String(v1) # "abc"
fill!(v2, 0)
s # now all zeros
Here is a bug on master where a vector now is able to read from out of bounds (fixed by this PR):
v1 = Memory{UInt8}(codeunits("abc"))
v2 = view(v1, 1:3)
s = String(v1)
println(isempty(v2.ref.mem)) # true
v2[1] # accesses v2.ref.mem OOB
More broadly speaking, I'm not suggesting that we compromise on our APIs in order to categorically prevent unsafe behaviour. We're not Rust, and need to be practical about what is unsafe. However:
-
This is not some crazy edge case, it's pretty easily accessible memory bugs that can happen by accident way too easily through no obvious fault of the user
-
This is not an API compromise. In my opinion,
takestring!
provided here, and the change in #54452 makes Julia's API more consistent, and less surprising, because it allows users to not have to rely on the weird magic behaviour of truncatingString
, while not providing people with vectors that are sometimes string-backed and sometimes not.
@jakobnissen Thank you for your tireless efforts here. I do like takestring!
and I agree with the approach recommended by triage. See comments about ensuring copies are avoided where possible.
From the implementation here the difference between takestring!
and unsafe_takestring!
is not clear to me. Can an unsafe version share data in more cases? And my understanding is that the unsafe version would be used internally when we know the IOBuffer has no other references? However in the changes here it is not consistently used that way. Maybe we don't need both functions for IOBuffer?
Thanks for the feedback. I've now changed both takestring!
and unsafe_takestring!
to avoid making copies, where possible. I didn't do that previously, because I mistakenly thought that IOBuffers
' memory wasn't backed by strings so it would copy anyway. That was weird, because I did know that it was supposed to be zero-copy, but I guess after revising it a few times I lost the forest for the trees.
The unsafe
variant is a tiny bit faster because a) it assumes the buffer is writable and seekable, and b) it doesn't spend time resetting the buffer to a usable state. In benchmarks, it's only about 10% faster, but still, worth having IMO.
Edit: I think the 32-bit Windows test failure is spurious. Tests do seem to pass.
Edit2: With these changes, takestring!
is now a little faster than String(take!(buf))
:
The following code
function foo(data)
buf = IOBuffer()
write(buf, data)
String(take!(buf)) # or takestring!(buf) or unsafe_takestring!(buf)
end
@benchmark foo("abc")
Takes the following time:
-
String(take!(buf))
on 1.11-beta1: 74 ns -
String(take!(buf))
on this PR: 64 ns -
takestring!(buf)
: 54 ns -
unsafe_takestring!(buf)
: 48 ns
Rebased. Test failures look unrelated, but it's suspicious that they now failed twice on Windows.
Bump. What needs to be done to get this merged?
FYI FWIW there are unresolved conflicts, as indicated by Github.
Bump
I'm not completely sold on the name unsafe_takestring(::Memory)
. It seems quite different from unsafe_takestring!(::IOBuffer)
, since it doesn't "take" anything from the Memory
.
Maybe unsafe_string(::Memory)
? (This would export it, since unsafe_string
is an exported symbol, but if we wanted to avoid this for now we could call it _unsafe_string(::Memory)
.)
IMO it's fine. It's similar to unsafe_takestring!
, which doesn't take anything from the IOBuffer either. Or, viewed from another angle, it does take the memory away in the sense that the memory cannot be safely modified after use, because the ownership has been "taken" by someone else.
That being said, I don't mind if it's changed.
After a call to unsafe_takestring(::Memory)
it is also unsafe to read from the memory because Julia can free the memory. For example, here is an example that leads to a seg fault.
function foo(n)
m = Base.StringMemory(n)
m .= 0x43
Base.unsafe_takestring(m)
m
end
m = foo(1_000_000)
GC.gc()
GC.gc()
GC.gc()
GC.gc()
m[1]
Bump - all comments have been addressed. Failures seem unrelated.
If it matters, this SGTM. Even if we keep around the mutating behavior of String(Vector{UInt8}) for now until v2, it still seems good for readability to add these more precise function names now.