cockroach
cockroach copied to clipboard
opt: FoldAssignmentCast ignores transaction errors
Describe the problem
Please describe the issue you observed, and any steps we can take to reproduce it:
Currently, if a transaction error is hit inside FoldAssignmentCast when calling PerformAssignmentCast the errors are silently swallowed. This can happen when dealing with OIDs which will potentially involve running SQL using the internal executor. When this scenario is encountered we continue execution like normal with a potentially aborted transaction causing failures in other code paths.
To Reproduce
The original issue was reproduced on the randomized schema changer workload, when inserts with OID casts are executed. However, this issue can be easily faked by forcing a transaction to abort when an OID involved in an insert is being resolved.
Jira issue: CRDB-18380
@fqazi Do you have reproduction steps for this?
Yahor says it's difficult to reproduce, but it's possible to fake an error to see the problems this creates.
The transaction retry is hard to repro standalone if you take the ResolveOID* and inject a transaction retry error or generate a real one with a paused thread. If we need to validate the fix the random-load variant of the schema changer workload is pretty easy to run, and we can see it in 10 minutes at most.
https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L397-L400
Great find! This is a bad one. I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases. This relates to #74470.
any cast that has to read some mutable state should not be folded during normalization
I remain meh on this take. I suspect that folding regclass casts is probably important for a bunch of queries. I think that for all intents and purposes, these casts are stable and for the usage of the memo, are even immutable. Ideally we'd just propagate errors which relate to concurrency control. I know that's a bit tricky.
Alright in light of #74470, I agree that immutable is not fitting. Stable is fitting and folding these stable values may be important.
I thought we had a linter for not returning errors (returnerrcheck)? But it looks like this problem (i.e. not returning an error that could have caused the txn to abort) has been happening in a few places.
The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.
The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.
I see, thanks.
So do we know if this issue is another cause of https://github.com/cockroachdb/cockroach/issues/82034 and https://github.com/cockroachdb/cockroach/issues/80764?
So do we know if this issue is another cause of https://github.com/cockroachdb/cockroach/issues/82034 and https://github.com/cockroachdb/cockroach/issues/80764?
Yes, I believe we do know that this is a cause of #80764 (https://github.com/cockroachdb/cockroach/issues/80764#issuecomment-1206720146).
@mgartner regarding your earlier comment:
I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases.
does that work you're describing to audit all the casts need to happen before making the change to stop swallowing the error at https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L397-L400
No, I can fix that now, I think.
I'll point out that the pattern of swallowing an evaluation error is used in other cases:
https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L469-L472
https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L514
https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L618-L621
The reason we can do this is because we only try to evaluate immutable or stable expressions. I am hesitant to change all the places where we swallow errors - folding expressions is best-effort, and if it fails for any reason, we'd prefer to continue with optimization and leave the expression as-is. So I think changing this assignment cast case is a temporary solution - ideally we can fix the volatility of the cast and then revert to swallowing errors.
An approach we've taken elsewhere is to classify the errors which can be swallowed safely. We don't have a unified way to do this presently. In the cases I'm thinking of, we literally passed a boolean with the error. That wasn't great.
Looks like this was fixed by #87614.