julia icon indicating copy to clipboard operation
julia copied to clipboard

New `copystring` function

Open nhz2 opened this issue 9 months ago • 13 comments

Since calling String on a view of an AbstractVector{UInt8} is potential UB https://github.com/JuliaLang/julia/pull/54372#issuecomment-2102885169 I think it would be helpful to have a function.

copystring(v::AbstractVector{UInt8}) = take_string!(copyto!(StringMemory(length(v)), v)))

I could use this to fix ZipArchives.jl by replacing current uses of the String constructor on vectors of bytes with copystring.

Would this be better in a separate CopyStrings.jl package?

nhz2 avatar May 09 '24 17:05 nhz2

Here is an example where using the String constructor on a Vector{UInt8} can result in mutating a String

io = IOBuffer()
write(io, [0x61,0x62,0x63])
seekstart(io)
a = read(io, 3)
b = reshape(a, 1, 3)
c = reshape(b, 3)
s = String(a)
@show s
c[1] = 0x62
@show s

This changes s from "abc" to "bbc" This happens on Julia 1.6 as well as nightly.

nhz2 avatar May 10 '24 15:05 nhz2

copystring sounds like a string is being copied which is confusing.

KristofferC avatar May 10 '24 15:05 KristofferC

Maybe bytes2string is a better name?

nhz2 avatar May 11 '24 18:05 nhz2

Python calls this function bytes.decode.

bytes2string is clear, which is the most important factor imo.

2 names in Base: bytes2hex, deg2rad, hex2bytes, hex2bytes!, rad2deg. to names in Base: htol, hton, ltoh, ntoh.

It should maybe take a Type{AbstractString} argument so it can support Strs.jl etc.

jariji avatar May 11 '24 18:05 jariji

There is already a transcode(::Type{String}, src) function. https://github.com/JuliaLang/julia/blob/d01d2561e69822667a8d809064109cd46fba22c1/base/strings/cstring.jl#L179

Maybe this can be changed to avoid mutating the input.

nhz2 avatar May 11 '24 19:05 nhz2

One solution could be to make take! and read on IOBuffer not use Base.StringVector, but allocate a normal vector. This will make using IOBuffer for creating strings less efficient, but that is addressed by the new takestring! function in https://github.com/JuliaLang/julia/pull/54372.

I'm willing to pitch a PR if that's an acceptable solution.

jakobnissen avatar May 13 '24 12:05 jakobnissen

I think that is a good solution.

oscardssmith avatar May 13 '24 12:05 oscardssmith

This will make using IOBuffer for creating strings less efficient,

That seems completely unacceptable? Isn't the whole reason not to make this slow pretty much the reason for the current behavior of String?

KristofferC avatar May 13 '24 13:05 KristofferC

@KristofferC the point is that with this change takestring!(iobuffer) will be as efficient as String(take!(iobuffer)) is now, while String(take!(iobuffer)) will have a copy. The reason for this change is that a single takestring! fully encapsulates the sketchy part where an immutable String is created from mutable data without a copy, while our current idiom for this exposes the mutable object to the user which creates the possibility of the user accidentally holding on to a reference of the mutable data.

With this change in combination with https://github.com/JuliaLang/julia/pull/54372, existing users will see a regression that is fixed by replacing instances of String(take!(iobuffer)) with takestring!(iobuffer) (we will likely make a Compat PR to make takestring! available in older julia versions).

oscardssmith avatar May 13 '24 13:05 oscardssmith

he point is that with this change takestring!(iobuffer) will be as efficient as String(take!(iobuffer)) is now, while String(take!(iobuffer)) will have a copy.

The whole reason we didn't do this X years ago was that adding this type of perf regression to code run everywhere was not acceptable. I don't see why it would be now.

KristofferC avatar May 13 '24 13:05 KristofferC

This is the comment I was thinking about: https://github.com/JuliaLang/julia/issues/24388#issuecomment-341206183

KristofferC avatar May 13 '24 13:05 KristofferC

Those who don't know history are destined to repeat it.

Reading through https://github.com/JuliaLang/julia/issues/24388, https://github.com/JuliaLang/julia/pull/25241, https://github.com/JuliaLang/julia/pull/25846, https://github.com/JuliaLang/julia/pull/26093 is probably a good idea for people interested in this to know a bit about the discussing and history that has lead to the current behavior.

KristofferC avatar May 13 '24 13:05 KristofferC

String(copy(v)) still makes two copies as noted in https://github.com/JuliaLang/julia/pull/26093#issuecomment-366809258 In ZipArchives.jl I would like to avoid the extra copy. This is currently possible with unsafe_string, so an extra function in Base is not needed, but IMO it would be nice to avoid having to deal with pointers.

nhz2 avatar May 13 '24 15:05 nhz2

Fixed by #54927

nhz2 avatar Jul 31 '24 22:07 nhz2