cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

opt: FoldAssignmentCast ignores transaction errors

Open fqazi opened this issue 3 years ago • 9 comments

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it: Currently, if a transaction error is hit inside FoldAssignmentCast when calling PerformAssignmentCast the errors are silently swallowed. This can happen when dealing with OIDs which will potentially involve running SQL using the internal executor. When this scenario is encountered we continue execution like normal with a potentially aborted transaction causing failures in other code paths.

To Reproduce

The original issue was reproduced on the randomized schema changer workload, when inserts with OID casts are executed. However, this issue can be easily faked by forcing a transaction to abort when an OID involved in an insert is being resolved.

Jira issue: CRDB-18380

fqazi avatar Aug 05 '22 18:08 fqazi

@fqazi Do you have reproduction steps for this?

mgartner avatar Aug 09 '22 18:08 mgartner

Yahor says it's difficult to reproduce, but it's possible to fake an error to see the problems this creates.

mgartner avatar Aug 09 '22 18:08 mgartner

The transaction retry is hard to repro standalone if you take the ResolveOID* and inject a transaction retry error or generate a real one with a paused thread. If we need to validate the fix the random-load variant of the schema changer workload is pretty easy to run, and we can see it in 10 minutes at most.

fqazi avatar Aug 09 '22 19:08 fqazi

https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L397-L400

ajwerner avatar Aug 10 '22 20:08 ajwerner

Great find! This is a bad one. I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases. This relates to #74470.

mgartner avatar Aug 11 '22 00:08 mgartner

any cast that has to read some mutable state should not be folded during normalization

I remain meh on this take. I suspect that folding regclass casts is probably important for a bunch of queries. I think that for all intents and purposes, these casts are stable and for the usage of the memo, are even immutable. Ideally we'd just propagate errors which relate to concurrency control. I know that's a bit tricky.

ajwerner avatar Aug 11 '22 00:08 ajwerner

Alright in light of #74470, I agree that immutable is not fitting. Stable is fitting and folding these stable values may be important.

ajwerner avatar Aug 11 '22 00:08 ajwerner

I thought we had a linter for not returning errors (returnerrcheck)? But it looks like this problem (i.e. not returning an error that could have caused the txn to abort) has been happening in a few places.

rafiss avatar Aug 11 '22 03:08 rafiss

The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.

ajwerner avatar Aug 11 '22 03:08 ajwerner

The returnerrcheck linter doesn't do anything if the function doesn't itself return an error. The optimizer doesn't really return errors, it panics them, so the linter does not trigger.

I see, thanks.


So do we know if this issue is another cause of https://github.com/cockroachdb/cockroach/issues/82034 and https://github.com/cockroachdb/cockroach/issues/80764?

rafiss avatar Aug 11 '22 16:08 rafiss

So do we know if this issue is another cause of https://github.com/cockroachdb/cockroach/issues/82034 and https://github.com/cockroachdb/cockroach/issues/80764?

Yes, I believe we do know that this is a cause of #80764 (https://github.com/cockroachdb/cockroach/issues/80764#issuecomment-1206720146).

ajwerner avatar Aug 11 '22 16:08 ajwerner

@mgartner regarding your earlier comment:

I want to take some time to audit all these casts - any cast that has to read some mutable state should not be folded during normalization. @ajwerner and I also discussed how labelling these casts as IMMUTABLE is deceiving at best, and incorrect at worst. I'll have to think more about these cases.

does that work you're describing to audit all the casts need to happen before making the change to stop swallowing the error at https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L397-L400

rafiss avatar Sep 01 '22 20:09 rafiss

No, I can fix that now, I think.

mgartner avatar Sep 01 '22 22:09 mgartner

I'll point out that the pattern of swallowing an evaluation error is used in other cases:

https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L469-L472

https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L514

https://github.com/cockroachdb/cockroach/blob/31ffb42b585d0b196a438cddfdae56b28ff0e674/pkg/sql/opt/norm/fold_constants_funcs.go#L618-L621

The reason we can do this is because we only try to evaluate immutable or stable expressions. I am hesitant to change all the places where we swallow errors - folding expressions is best-effort, and if it fails for any reason, we'd prefer to continue with optimization and leave the expression as-is. So I think changing this assignment cast case is a temporary solution - ideally we can fix the volatility of the cast and then revert to swallowing errors.

mgartner avatar Sep 01 '22 22:09 mgartner

An approach we've taken elsewhere is to classify the errors which can be swallowed safely. We don't have a unified way to do this presently. In the cases I'm thinking of, we literally passed a boolean with the error. That wasn't great.

ajwerner avatar Sep 02 '22 01:09 ajwerner

Looks like this was fixed by #87614.

mgartner avatar Jun 08 '23 16:06 mgartner