rust New panics from sort detecting Ord/PartialOrd violations

trafficstars

1.81 brings the new sort implementations, added in https://github.com/rust-lang/rust/pull/124032/. The new implementations will panic if they detect that Ord doesn't deliver a total order (with new # Panics sections added to documentation:

May panic if the implementation of Ord for T does not implement a total order.

These panics represent "bugs" in user code, but it seems plausible that the new source of panics in user programs may be rather unexpected. At minimum, it seems likely that we want some mention of this in the compatibility notes section of 1.81's release notes, but I'm not sure how to best phrase it; "total order violations" is probably considerably too opaque for most of our audience.

I didn't see any discussion of these panics in the PR landing this work, though it may have happened elsewhere, so opening this to (a) make T-libs-api aware, if they weren't already and (b) ask for guidance on the right way to communicate this and any longer-form advice we can recommend/provide on what to do when such a panic is encountered. (Note that this affected at least one sort in our own code, encountered during the bootstrap bump: https://github.com/rust-lang/rust/pull/128083#issuecomment-2245145747).

I'd also love some clarity on the PR description's phrasing of "detecting strict weak ordering violations with a high chance" -- what does high chance here mean?

cc @Voultapher @orlp

Aug 25 '24 14:08 Mark-Simulacrum

These panics represent "bugs" in user code, but it seems plausible that the new source of panics in user programs may be rather unexpected.

I mean, you can remove the quotes, they can only occur if the user implemented Ord in a way that violates the requirements specified in the Ord docs.

I'd also love some clarity on the PR description's phrasing of "detecting strict weak ordering violations with a high chance" -- what does high chance here mean?

I believe @Voultapher wrote that, and that he meant 'with high probability'. That said, I don't think we've ever properly evaluated the probability of the panic occurring if there's a bug in the Ord implementation, and obviously not all bugs have the same chance. So I think it should probably be reworded to "there's a good chance" to highlight the informality of that claim.

In essence, if your Ord implementation is incorrect, there's a good chance that the new sorting algorithm will end up in a state where we can't progress the sort due to an invariant being broken. We had two choices in this scenario: return knowingly unsorted data, or throw a panic. We chose the latter.

I do remember asking people for advice which of the two we should choose, but not where or if the entire T-libs-api team was involved. At least @thomcc and @workingjubilee are aware, from the top of my head.

any longer-form advice we can recommend/provide on what to do when such a panic is encountered

The advice is to fix the Ord implementation so that it provides a proper total order. I'm not sure what the best method is to teach people what that means, or what the most common mistakes are. At the very least you can mention that the operation needs to be transitive, so if you return a < b and b < c you better make sure a < c.

Aug 25 '24 15:08 orlp

There is an open PR https://github.com/rust-lang/rust/pull/129003 that aims to improve the Ord documentation, including examples of common anti-patterns found in real code that gets Ord wrong.

Aug 25 '24 16:08 Voultapher

"total order violations" is probably considerably too opaque for most of our audience.

"failure to comply with the requirements of Ord". "failure to correctly implement Ord". "incorrectly implementing Ord".

The first time that the relnotes mention these requirements, they should also mention something like "and PartialOrd (which must be consistent with Ord)".

Aug 25 '24 17:08 workingjubilee

This has been included in the release notes and libs is satisfied with the wording, closing.

Sep 03 '24 16:09 Amanieu

FWIW I'm a bit surprised by this new behavior and worried that it will cause quite a few crashes in production where it was previously just returning buggy output.

Did we consider shipping with a debug-only assertion for a couple of versions before going into full panic mode?

Sep 06 '24 06:09 Turbo87

@Turbo87 I am not sure if you are aware of the following, so apologies if I am belaboring you with things you already know:

We cannot enable normal cfg!(debug_assertions) in core/std due to its precompiled nature. At least, not ones that are visible to people compiling code by linking std in, instead of, well, compiling std itself. The form of debug_assert! in std that works in a way that would make it useful for what you describe is modeled by assert_unsafe_precondition!. It is... not as nice as normal panic!s due to a few details, like how it simply aborts, and because not everything relevant has #[track_caller] to show you what caused the issue because of binary size / compile time concerns. It is easy for a situation to become complex enough to make it hard to hunt down the crashes caused by assert_unsafe_precondition!, much harder than any ordinary panic message. I once spent a week just trying to tease apart a crash introduced by such because it happened under a few different layers.

I do not mean to bag on any of Saethlin's work by saying this! We've wanted this power for some time, and we're rapidly getting comfortable with the idea it is an option, but it is still not a peer option with "simply put an assert! in the code" in terms of diagnostics. I believe we can rapidly make this much better if we have a mind to do so (we may have to decide to accept other tradeoffs, like binary size, compile time, etc.). I am just saying that assert_unsafe_precondition! in its current form is mostly justified by the fact that the alternative is compiling code with undefined behavior, which is capable of doing far more than just "buggy output".

So, my personal opinion is if I were maintaining a codebase that was gotcha'd by this, I'd rather just get the current situation. Maybe it's a punch in the mouth, but at least I could clearly see it coming.

Sep 06 '24 07:09 workingjubilee

I see, thanks for the explanation! :)

Sep 06 '24 07:09 Turbo87

I am maintaining a codebase that kinda intentionally abused this. The crate in question is insta which just needed some mostly stable order for snapshots. I was assuming that the incorrect implementation from Ord is not an issue because sort_by also only documented that at worst what happens as a violation from total order is bad output, but not crashes.

I am obviously going to change this, but from my experience there is a lot of code that relies on this behavior (for not ideal reasons). From my experience a lot of this comes from PartialOrd being able to do something that you cannot do yourself today (getting the discriminant of an enum that also implements Ord. See https://github.com/rust-lang/rust/pull/106418)

I do think that it's acceptable to make this change but I believe that at least the change to sort_by is a break in backwards compatibility. The function did not panic before, was not documented as panic and did not imply that the behavior could change in this way.

Old documentation:

The comparator function must define a total ordering for the elements in the slice. If the ordering is not total, the order of the elements is unspecified.

New documentation:

Panicks: May panic if compare does not implement a total order.

From where I'm standing this is a backwards incompatible change that I did not expect.

Sep 06 '24 09:09 mitsuhiko

@mitsuhiko If we're being pedantic, you were violating the contract as well in the old wording. It doesn't say "you may use a non-total ordering, in such a scenario the order is unspecified", it says "the comparator function must define a total ordering for the elements in the slice".

How I see it is that you were always using library UB, and the order of the elements being unspecified was just a clarification on what this could entail, not a guarantee that the library UB will always solely consist of this effect.

Sep 06 '24 09:09 orlp

Arguably the relevant docs are those for the trait Ord because that's what's being implemented incorrectly:

Violating these requirements is a logic error. The behavior resulting from a logic error is not specified, but users of the trait must ensure that such logic errors do not result in undefined behavior. This means that unsafe code must not rely on the correctness of these methods.

So it's not completely unreasonable to say users could have known that they are not getting some guaranteed behavior when violating Ord.

Sep 06 '24 09:09 Voultapher

@Voultapher The quoted docs was from sort(_unstable)_by, not from Ord.

Sep 06 '24 09:09 orlp

@orlp I know, they were added in 7e57f0a6a8fd4e5df7890613918f4a2c3b5a1fb7, my point is that users had a chance to read the Ord docs when implementing Ord and realizing they were relying on unspecified behavior.

Sep 06 '24 09:09 Voultapher

For what it's worth, if we're being extra pedantic, select_unstable_by never mentioned requiring a total order (it didn't really state what the comparison function should do at all), and it can panic now as well, I believe.

Sep 06 '24 09:09 orlp

I don't care much about the Ord part here. I think that is fair and square a failure of the library implementing it incorrectly. The trait was reasonable here.

My point is on sort_by. While it did say "must define a total ordering" it also documented what happens if that does not happen, and I know that I wrote a lot of code in the past that was okay with that bad ordering but would not have been with panicking.

If you look for sort_by paired with unwrap_or(Ordering::*) you will encounter countless of pieces of code on GitHub that used that. The originator of this obviously primarily being caused by f32/f64 in the mix.

Sep 06 '24 09:09 mitsuhiko

@mitsuhiko For the record, I understand your frustration, and I was just being pedantic. It's not like we actively sought out to insert a panic for bad ordering, but we ended up in a code path in our implementation where it would not be sound to continue the sort if we detected an invariant was not kept after a loop, which could only occur if the order was not total.

At this point we had two options, have a code path in the sort that with full knowledge silently returns badly ordered data to the user, or panic. I initially chose the latter for glidesort, and we kept it for the sort implementations here. We did ask for advice whether we should keep it as a panic, and it was suggested we did. The libs team also signed off on it.

If you look for sort_by paired with unwrap_or(Ordering::*) you will encounter countless of pieces of code on GitHub that used that.

And they are all incorrect. I'm sure some uses of this pattern were done with full knowledge that (e.g. in the context of floats) if there exists a single NaN then the entire ordering is unspecified, but I doubt it's even a majority. I expect that most users don't even realize it's wrong, or if they do, that they believe only the position of NaNs would be corrupted rather than the entire order.

Sep 06 '24 09:09 orlp

@mitsuhiko I am kind of curious about this:

I am maintaining a codebase that kinda intentionally abused this. The crate in question is insta which just needed some mostly stable order for snapshots.

What was the intentional total order violation, and what is the "mostly stable order" you expected?

Sep 06 '24 09:09 orlp

If you look for sort_by paired with unwrap_or(Ordering::*) you will encounter countless of pieces of code on GitHub that used that. The originator of this obviously primarily being caused by f32/f64 in the mix.

Going at it a little more empirically, using grep.app to get some idea of the scale of the problem, the baseline search .sort_by plus sort_unstable_by gives us ~11k matches. Refining it down to sort_(unstable_)by\(.*\n*.*unwrap_or gives us 49 matches.

For example https://github.com/pykeio/ort/blob/60f6ecae9064a75a474678998af42ea87067b1b1/examples/training/examples/train-clm-simple.rs#L129

They want to sort f32, which hopefully with the new docs as part of https://github.com/rust-lang/rust/pull/128273 should point them to total_cmp in the future. If their code didn't run into any NaNs they get the same behavior as before, if they did their results were bogus and it's not clear whether the authors didn't care or didn't notice any issues arising from it.

In effect we have ~0.4% of the _by calls being affected of which the majority is problematic floating point usage.

Sep 06 '24 09:09 Voultapher

@mitsuhiko could you please elaborate why you think the PartialOrd and Ord docs don't apply in your case, looking at the code that you had to change https://github.com/mitsuhiko/insta/pull/586/files#diff-5b1e28e66bc6cd3e0264e8b5c8cb2e91141d05e549f3e1404f220847c0dd849dL26 suggestsOrd was implemented incorrectly in the past in the insta crate, so how come we are talking about sort_by?

Sep 06 '24 09:09 Voultapher

@orlp

How I see it is that you were always using library UB, and the order of the elements being unspecified was just a clarification on what this could entail, not a guarantee that the library UB will always solely consist of this effect.

(Note that everything in the public API here in question is safe functions and safe traits. There is no way to invoke UB using only safe functions and safe traits, or it is a bug in the library. If by "library UB" you mean something other than "misuse of a library that can lead to (language) UB but is not necessarily immediate (language) UB" (e.g. non-UTF-8 data in a str) then that could perhaps be clarified.)

The prior documentation of sort_by does not read to me as saying "this is what could happen", it reads to me as saying "this is what will happen". That said, yeah this is a minor issue and overall probably a better API before, so if the change was approved with knowledge that it might be breaking then 🤷.

Sep 06 '24 09:09 zachs18

@zachs18 By "library UB" I mean violating the requirements of the standard library, in which case in principle any sound behavior is permitted (or any behavior at all for unsafe functions, if not specified otherwise). E.g. violating Borrows requirements for std::collections::HashMap could result in wrong items being returned, or internal data structure panics to be generated.

Sep 06 '24 09:09 orlp

@mitsuhiko could you please elaborate why you think the PartialOrd and Ord docs don't apply in your case

As I said before, the implementation on Ord was objectively wrong and I do not dispute this. I am talking about the changes to sort_by and friends. insta still has another issue related to this I did not fix either which I still will need to (https://github.com/mitsuhiko/insta/issues/587).

I'm happy if I'm the only idiot that did this wrong :)

What was the intentional total order violation, and what is the "mostly stable order" you expected?

In my case anything other than a panic. You would have noticed that the snapshots were probably somewhat unstable. There are a few things that make snapshots unstable and I deal with them as they appear.

My general expectation would have been arbitrary order since this is also what other languages do. I'm generally not aware of a sufficiently high level language (please ignore C++ here :)) that errors if a comparator does not have total order but maybe I missed it. Usually you just get nonsense. Given how rare NaNs show up I think in practice you will most likely run into those panics very infrequently and end up surprised.

Sep 06 '24 10:09 mitsuhiko

Genuine question, why sort at all when arbitrary order is fine?

Sep 06 '24 11:09 Voultapher

Genuine question, why sort at all when arbitrary order is fine?

I don't want to pile up on this too much but in an ideal world the way sorting would work is that you have a fallible operation you can respond to when it fails. Both random order and panics are bad because they happen rarely so you will have written code that is buggy for a long time until it starts doing the wrong thing once by either crashing or doing the wrong thing. A fallible operation makes it explicit what went wrong.

Cases where I think crashing is worse than just not sorting would be computer games. Quite often in simulations you end up getting a NaN for a frame and hobbling onwards might work out better. Floats are pesky and odd stuff will happen when I try to do something with a sorted list but it does not end up sorted, but often it will repair itself next frame.

It's not so much that it can be in arbitrary order in all times, but the failure more of it just being not sorted is preferable to outright crashing immediately.

To be clear: in general I'm not against this change, I just was surprised that it was not seen as backwards incompatible as I would assume quite a few people took it for granted that at worse you get a bad sorting order, and not a panic.

Insta is a special case I don't think is worth thinking too much about. It's a bug that needs fixing and in part is fixed, it just happens to be the crate I remembered as being affected so that's where I started.

Sep 06 '24 12:09 mitsuhiko

For what it's worth, if it turns out the 'panic on invalid Ord' behavior is unwanted due to feedback of e.g. game developers as you mention, we can always turn it off again. The implementation only guarantees that it may panic on invalid Ord, and 'never' is a subset of may.

I'd like to think we did our due diligence but perhaps we should've done more. Perhaps there could be a better formal process in the future for adding panics to existing function calls, although I guess it doesn't come up a lot as it's a very rare scenario in which it could even be considered.

Sep 06 '24 12:09 orlp

I see, thanks for explaining. In my experience, coming from C++ it's a breath of fresh air that Rust prides itself on being explicit and surfacing bugs to users earlier than later, and while a compile-time error for bad Ord implementations would be my preferred option, it's not clear how we would get there.

The word "crash" is a little unfortunate in this conversation, as it usually implies UB and a segfault, panics can be recovered from and don't necessarily kill a process.

Sep 06 '24 12:09 Voultapher

We add new panics all the time to the standard library. To not have the freedom to do so would prevent internal maintenance, nevermind actually reviewing issues with the API and solving them. I do not believe anything the standard library documents should be considered as a promise not to panic.

I say this because there are quite a number of functions in the standard library that panic on certain inputs that do not document that they do so because what happens is they call code that then decides to panic on that input. We don't necessarily consider these calls to be stable interface details, so we don't document that panic.

To believe that we have even documented all existing panics is incorrect.

Sep 06 '24 17:09 workingjubilee

My general expectation would have been arbitrary order since this is also what other languages do. I'm generally not aware of a sufficiently high level language (please ignore C++ here :)) that errors if a comparator does not have total order but maybe I missed it. Usually you just get nonsense. Given how rare NaNs show up I think in practice you will most likely run into those panics very infrequently and end up surprised.

Explicitly throws:

Merely ominously "requires":

go's slices.SortFunc

Sep 06 '24 17:09 workingjubilee

I think it's fine to panic, but I suspect that this change will be incredibly invasive for quite a while. On the one hand because of a lot of pre-existing code. Ruffle for instance recompiled with the latest version of rustc and seems to already cause failures in production: https://github.com/ruffle-rs/ruffle/issues/17785

@workingjubilee since Java and C# defines total ordering for floats for a while I suppose I did not run into that. However for C# in particular I'm fairly certain that I have implemented non total compare order functions before and I don't think I have every run into an exception being thrown. Again, my memory might be hazy, but I have been writing code for .NET for quite a few years on and off and I definitely do not remember getting exceptions from sort.

In general creating non total order functions is very simple so I would expect this to come up more more commonly if throwing exceptions/failing would be common. I do absolutely remember bad sort results a ton.

But just to I guess hammer down this point: I'm not suggesting this should be changed, at least I do not have a strongly held opinion. I'm primarily surprised about how swift this was changed without much of a discussion ahead of it as I do expect regressions through this change.

Sep 06 '24 18:09 mitsuhiko

Hm. I'm curious, @mitsuhiko, did you primarily use List.Sort() or one of the other Sort overloads that accept IComparer or Comparison? They document different errors (notably, the nullary one doesn't) in the case we're discussing.

Sep 06 '24 19:09 workingjubilee

Hm. I'm curious, @mitsuhiko, did you primarily use List.Sort() or one of the other Sort overloads that accept IComparer or Comparison? They document different errors (notably, the nullary one doesn't) in the case we're discussing.

I think both. I looked into this a bit more now out of curiosity and the ones with comparer can definitely fail. The sort code in roslyn basically appears to just catch when the sort algorithm would attempt to index out of bounds and then raises an "comparer is bogus" error. So I guess the situation is similar to here.

Sep 07 '24 13:09 mitsuhiko

rust rust copied to clipboard

New panics from sort detecting Ord/PartialOrd violations

rust
rust copied to clipboard