vitess
vitess copied to clipboard
vreplication: error out if table match fails
We had a legacy Materialize vreplication workflow running where the match rule was for a table that no longer existed, but it happily continued to report that was was running without error. My expectation is that it would have errored out that the insert/update on the destination table failed.
v13.0.0-14.0.0
@derekperkins can you please verify that it's not just constantly restarting and retrying? One of the main reasons why we capped the retry period when seeing the same error over and over again is it would often mask permanent failures like this that required manual intervention to resolve (it seems like it's always running because it's constantly retrying). We should see log messages if there are any errors seen. Thanks!
@derekperkins is this still an issue for you? I was going through existing issues and trying to clean things up.
I'm wondering if in this case the materialization was in the running phase, were we stream binlog events for the table. In this case there would be no binlog events for the table and thus nothing to stream and the workflow would not be aware that the reason it's not getting any is because the table doesn't exist anymore on the source.
Does this line up with what you saw? I think we'd have to catch the DDL for the source table (which is generally ignored in the stream by default) and then somehow permanently error the workflow (so that it doesn't stop/start again where it's just waiting for more binlog events in the stream from the given position, of which there will be no more).
I think your assessment is correct. After I noticed this, I just deleted the offending workflows, so I don't believe it's still an issue, though it's possible I haven't noticed new ones crop up. If it's not too hard to catch the DDL, that seems like a good solution, though I don't know how vital it is.
One of the main reasons why we capped the retry period when seeing the same error over and over again is it would often mask permanent failures like this that required manual intervention to resolve (it seems like it's always running because it's constantly retrying)
This actually got me today, as the state kept cycling back to RUNNING after binlogs had already been purged. That seems like a non-retriable error, at least locally, though it's probably worth trying again after a reparent. Not sure if that's worth a separate issue or not.