gpdb
gpdb copied to clipboard
Add throttle logic to slow down WAL writes in CATCHUP phase.
There was a guc repl_catchup_within_range that "Sets the maximum number of xlog segments allowed to lag when the backends can start blocking despite the WAL sender being in catchup phase." It was for the coordinator only. Now we could use it on segments also, but this mechanism is not sufficient if the new wal is generated faster - wal catchup would need more time and the primary is at the risk of data loss during the period due to syncrep off. So we do further throttle in large wal write function to slow down those transactions a bit and leave more resources to wal catchup.
xlog lag is calculated using GetFlushRecPtr() - XLogGetReplicationSlotMinimumLSN(), but note XLogGetReplicationSlotMinimumLSN() is not updated very frequently - I created a thead about this in pgsql-hackers mail list but this does not affect our patch much. (just slows down the write query further during wal catchup).
Co-authored-by: Haolin Wang [email protected]
Here are some reminders before you submit the pull request
- [ ] Add tests for the change
- [ ] Document changes
- [ ] Communicate in the mailing list if needed
- [ ] Pass
make installcheck
- [ ] Review a PR in return to support the community
Started to review this PR. Will try to provide most of my comments by this week.
Started to review this PR. Will try to provide most of my comments by this week.
Thanks!
I have updated the code to incorporate with the comments and had a preliminary test against 6X_STABLE in the lab. The result as below:
concurrency: 100 clients throttle record size: 1kB average insert record size: 2kB+ number of records per client: 100000 total data size: 20GB+ throttle vs no throttle: catchup duration: -20% (lower than no throttle) normal insert duration: +4% (higher than no throttle)
My updates contain the following aspects:
- Separate the PR to two commits:
- Commits start blocking only if STREAMING or CATCHUP within range. (almost no change)
- Add throttle logic to slow down WAL writes in CATCHUP phase. (changes mainly in this one)
- Revised throttle logic to reduce the scope of the exclusive lock WalCatchupThrottleLock.
- Add an atomic variable state_value to identify specific state related condition atomically, instead of using spinlock.
- Add a generic callback function WalSndSetStateCallback to converge state related indicators to a single state_value.
Currently, it only supports catchup state. - Rewrite gp_is_mirror_catching_up() to check mirror effective catchup state indicated by state_value atomically.
Thanks for reviewing this, @divyeshddv. I agree on taking the first commit forward and leave the second there for further evaluation.
Closing this PR for now, if the problem is reported with current code and need is felt for the same.