julia
julia copied to clipboard
Switch supertype of CodeUnits from DenseVector to AbstractVector
This is a breaking change, but it was a bug that it subtyped DenseVector
previsouly, as there is no guarantee that CodeUnits are stored densely in memory.
Test failures currently seem unrelated
See #53996
As mentioned in #53996, since this is a breaking bugfix, this needs a PkgEval + Nanosoldier run in case someone relied on its supertype. Depending on the results of these runs, it may be triage worthy.
@nanosoldier runbenchmarks(ALL, vs=":master")
@nanosoldier runtests()
Can we do both in the same comment?
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.
Nanosoldier report looks fine - some suspicious numbers for BitSet
but I can't see how that could be related.
Those are a couple benchmarks going from 10ns on master to 20ns and the rest of the benchmarks on BitSet
seem fine (https://tealquaternion.camdvr.org/compare.html?end=54002&showRawData=true&nonRelevant=true&name=BitSet), so probably just noise.
The package evaluation job you requested has completed - possible new issues were detected. The full report is available.
There are some legitimate package errors in there: PyCall, StringViews, StrBase, Base58. Small enough that I could probably make PRs to all the packages to fix them, but then again, this might also mean this change causes issues in packages not in PkgEval.
This PR is too breaking.
However, we have a codeunits
function. We should, instead, require that strings with non-contiguous memory implement the codeunits
function and return something that is not a DenseVector
(or if it is, make sure it is a valid DenseVector
). This would be a small breaking documentation change.
(In 2.0?, we should probably just? remove DenseVector and make it a trait/interface)
There are some legitimate package errors in there:
It is not enough to look at if a package tests starts to fail based on a change to call it a "legitimate package error". Many times, packages directly test the buggy behavior in the test itself which causes the error while the package code itself is completely fine. For example:
StrBase is literally a direct call to strides
on a CodeUnits
: https://github.com/JuliaString/StrBase.jl/blob/90ac50265de0f356f414b9d493140fc724b7ae9d/test/basic.jl#L882 so that isn't breaking anything, it's just a nonsense test that tests the very thing that is being changed here.
Or StringViews.jl is directly testing the change that is done here: https://github.com/JuliaStrings/StringViews.jl/blob/f94b09e5f5056ab17a3f50490e500aee4476479d/test/runtests.jl#L35
PyCall is also just a test that isn't testing any real functionality in the package https://github.com/JuliaPy/PyCall.jl/blob/2f600fbebee50ab0672e153455e3c0fda1694fba/test/runtests.jl#L563
I guess the people calling this "too breaking" just looked at the list of PkgEval results and decided it based upon that but as someone who has to do this a lot of releases, that is insufficient to make any decision at all.
I would say that the current behavior is buggy, it should be changed, and the packages just need to tweak their tests a bit (not change any code in their package) to pass with this.
Some of these failures are significant. The failures you link to, which seem less significant, are just the first to fail in their test suite.
- For PyCall,
pybytes
is a user-facing function that no longer work on codeunits. Hence, this will break scripts that callpybytes
to pass it into Python. This is shown as an example on the README.md of PyCall, so I'd call this breakage. - For
StringView
s, a large part of the functionality of the whole package is defined as methods with::DenseStringView
, which relies on codeunits being a subtype of dense vector. I.e. if I remove the test you link to here, then it just fails with another test (concretely, regex matching then breaks, because it's only defined forDenseStringViewAndSub
)
StrBase though, does pass its tests when removing that one test for strides
.
I agree with Kristoffer (I can't make that triage time :( ). This is a bug, not a behaviour change we are doing willy nilly, if someone expects codeunits do be a dense vector and it's not that's UB, i.e passing it to a C function which then might segfault.
is there some reason triage's suggested fix wouldn't also fix the bug?
IMO triages change is more breaking and more subtle because it seems it would basically that that codeunits doesn't take an AbstractString as it's argument but something like a AbstractDenseString
That seems less likely to actually break existing code -- is there even any extant nontrivially used AbstractString
type that does not store its contents densely?
I mean, the point of this PR was precisely to do this experiment. It is one possible fix, but not necessarily the fix.
is there even any extant nontrivially used
AbstractString
type that does not store its contents densely
Yes. The issue in #53996 is InlineStrings.jl — those are backed by non-mutable structs that don't necessarily land on the heap with a memory address. They do have support for grabbing a pointer through a Ref{<:InlineString}
, but it's not a c-compatible one: it's backwards (stride -1), the pointer
should really be offset by N for each InlineN, and it's not necessarily null terminated. It's at that c boundary that's at the crux of where many of these failures are coming from, actually.
It's very much worth noting that CodeUnits{T, <:InlineString}
is currently broken (and broken loudly) when trying to use codeunits as a dense vector:
julia> pointer(codeunits(inline"hello"))
ERROR: MethodError: no method matching unsafe_convert(::Type{Ptr{UInt8}}, ::String7)
so the triage suggested change would be to define codeunits(s::InlineString) = codeunits(string(s))
or something like that.
Looking through current uses of the type specifically, it seems like CodeUnits
has two purposes — or, rather, is commonly used in two contexts. It's either used to make a string act like an Julia vector (e.g., CSV does this) or it's used as a Cstring
and passed to C (or other languages like PyCall). The latter case doesn't just require it to be a DenseVector
, it needs to be stride 1, null-terminated, with UInt8
code units. And CodeUnits
doesn't promise any of that, but folks are using it like it does with varying levels of additional checks that (often partially) assert (some of) those preconditions.
It really seems like packages are using CodeUnits
as a way to launder strings into c-compatible strings on their way to a ccall, when they might as well just pass the string itself to ccall
directly. Or they may even assume that it's necessary. I get it, it really looks like I can pass codeunits("hello")
and even codeunits(inline"hello")
to C. The vibes are great, and it is a nice way to distinguish between APIs like separating a string file name from string data, which does seem to get tangled up in the process.
So that's at the root of the question, really. Is CodeUnits
a C-compatible vector like char*? Or is it a Julia vector? I don't think it's worth keeping it a DenseVector
if we don't additionally promise its pointer
can act like a Cstring
.