julia icon indicating copy to clipboard operation
julia copied to clipboard

Switch supertype of CodeUnits from DenseVector to AbstractVector

Open jakobnissen opened this issue 3 months ago • 20 comments

This is a breaking change, but it was a bug that it subtyped DenseVector previsouly, as there is no guarantee that CodeUnits are stored densely in memory.

Test failures currently seem unrelated

See #53996

jakobnissen avatar Apr 09 '24 07:04 jakobnissen

As mentioned in #53996, since this is a breaking bugfix, this needs a PkgEval + Nanosoldier run in case someone relied on its supertype. Depending on the results of these runs, it may be triage worthy.

jakobnissen avatar Apr 09 '24 07:04 jakobnissen

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier runtests()

Can we do both in the same comment?

mbauman avatar Apr 09 '24 13:04 mbauman

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

nanosoldier avatar Apr 09 '24 21:04 nanosoldier

Nanosoldier report looks fine - some suspicious numbers for BitSet but I can't see how that could be related.

jakobnissen avatar Apr 10 '24 07:04 jakobnissen

Those are a couple benchmarks going from 10ns on master to 20ns and the rest of the benchmarks on BitSet seem fine (https://tealquaternion.camdvr.org/compare.html?end=54002&showRawData=true&nonRelevant=true&name=BitSet), so probably just noise.

Zentrik avatar Apr 10 '24 08:04 Zentrik

The package evaluation job you requested has completed - possible new issues were detected. The full report is available.

nanosoldier avatar Apr 10 '24 12:04 nanosoldier

There are some legitimate package errors in there: PyCall, StringViews, StrBase, Base58. Small enough that I could probably make PRs to all the packages to fix them, but then again, this might also mean this change causes issues in packages not in PkgEval.

jakobnissen avatar Apr 10 '24 12:04 jakobnissen

This PR is too breaking.

However, we have a codeunits function. We should, instead, require that strings with non-contiguous memory implement the codeunits function and return something that is not a DenseVector (or if it is, make sure it is a valid DenseVector). This would be a small breaking documentation change.

(In 2.0?, we should probably just? remove DenseVector and make it a trait/interface)

LilithHafner avatar Apr 25 '24 03:04 LilithHafner

There are some legitimate package errors in there:

It is not enough to look at if a package tests starts to fail based on a change to call it a "legitimate package error". Many times, packages directly test the buggy behavior in the test itself which causes the error while the package code itself is completely fine. For example:

StrBase is literally a direct call to strides on a CodeUnits: https://github.com/JuliaString/StrBase.jl/blob/90ac50265de0f356f414b9d493140fc724b7ae9d/test/basic.jl#L882 so that isn't breaking anything, it's just a nonsense test that tests the very thing that is being changed here.

Or StringViews.jl is directly testing the change that is done here: https://github.com/JuliaStrings/StringViews.jl/blob/f94b09e5f5056ab17a3f50490e500aee4476479d/test/runtests.jl#L35

PyCall is also just a test that isn't testing any real functionality in the package https://github.com/JuliaPy/PyCall.jl/blob/2f600fbebee50ab0672e153455e3c0fda1694fba/test/runtests.jl#L563

I guess the people calling this "too breaking" just looked at the list of PkgEval results and decided it based upon that but as someone who has to do this a lot of releases, that is insufficient to make any decision at all.

I would say that the current behavior is buggy, it should be changed, and the packages just need to tweak their tests a bit (not change any code in their package) to pass with this.

KristofferC avatar Apr 25 '24 10:04 KristofferC

Some of these failures are significant. The failures you link to, which seem less significant, are just the first to fail in their test suite.

  • For PyCall, pybytes is a user-facing function that no longer work on codeunits. Hence, this will break scripts that call pybytes to pass it into Python. This is shown as an example on the README.md of PyCall, so I'd call this breakage.
  • For StringViews, a large part of the functionality of the whole package is defined as methods with ::DenseStringView, which relies on codeunits being a subtype of dense vector. I.e. if I remove the test you link to here, then it just fails with another test (concretely, regex matching then breaks, because it's only defined for DenseStringViewAndSub)

StrBase though, does pass its tests when removing that one test for strides.

jakobnissen avatar Apr 25 '24 11:04 jakobnissen

I agree with Kristoffer (I can't make that triage time :( ). This is a bug, not a behaviour change we are doing willy nilly, if someone expects codeunits do be a dense vector and it's not that's UB, i.e passing it to a C function which then might segfault.

gbaraldi avatar Apr 25 '24 11:04 gbaraldi

is there some reason triage's suggested fix wouldn't also fix the bug?

oscardssmith avatar Apr 25 '24 13:04 oscardssmith

IMO triages change is more breaking and more subtle because it seems it would basically that that codeunits doesn't take an AbstractString as it's argument but something like a AbstractDenseString

gbaraldi avatar Apr 25 '24 13:04 gbaraldi

That seems less likely to actually break existing code -- is there even any extant nontrivially used AbstractString type that does not store its contents densely?

brenhinkeller avatar Apr 25 '24 14:04 brenhinkeller

I mean, the point of this PR was precisely to do this experiment. It is one possible fix, but not necessarily the fix.

is there even any extant nontrivially used AbstractString type that does not store its contents densely

Yes. The issue in #53996 is InlineStrings.jl — those are backed by non-mutable structs that don't necessarily land on the heap with a memory address. They do have support for grabbing a pointer through a Ref{<:InlineString}, but it's not a c-compatible one: it's backwards (stride -1), the pointer should really be offset by N for each InlineN, and it's not necessarily null terminated. It's at that c boundary that's at the crux of where many of these failures are coming from, actually.

It's very much worth noting that CodeUnits{T, <:InlineString} is currently broken (and broken loudly) when trying to use codeunits as a dense vector:

julia> pointer(codeunits(inline"hello"))
ERROR: MethodError: no method matching unsafe_convert(::Type{Ptr{UInt8}}, ::String7)

mbauman avatar Apr 25 '24 15:04 mbauman

so the triage suggested change would be to define codeunits(s::InlineString) = codeunits(string(s)) or something like that.

oscardssmith avatar Apr 25 '24 17:04 oscardssmith

Looking through current uses of the type specifically, it seems like CodeUnits has two purposes — or, rather, is commonly used in two contexts. It's either used to make a string act like an Julia vector (e.g., CSV does this) or it's used as a Cstring and passed to C (or other languages like PyCall). The latter case doesn't just require it to be a DenseVector, it needs to be stride 1, null-terminated, with UInt8 code units. And CodeUnits doesn't promise any of that, but folks are using it like it does with varying levels of additional checks that (often partially) assert (some of) those preconditions.

It really seems like packages are using CodeUnits as a way to launder strings into c-compatible strings on their way to a ccall, when they might as well just pass the string itself to ccall directly. Or they may even assume that it's necessary. I get it, it really looks like I can pass codeunits("hello") and even codeunits(inline"hello") to C. The vibes are great, and it is a nice way to distinguish between APIs like separating a string file name from string data, which does seem to get tangled up in the process.

So that's at the root of the question, really. Is CodeUnits a C-compatible vector like char*? Or is it a Julia vector? I don't think it's worth keeping it a DenseVector if we don't additionally promise its pointer can act like a Cstring.

mbauman avatar Apr 25 '24 17:04 mbauman