sync_gateway
sync_gateway copied to clipboard
CBG-2213: Better handle CBGT feed rolling upgrades
CBG-2213
Change the behaviour of the sharded DCP feed to accommodate rolling online upgrades from pre-Helium to Helium nodes.
- Rather than registering a new feed type for the gocbcore-backed feed, reuse the existing feed type - when only streaming the default collection, there is no functional difference. This means that there is no need to delete and re-create the index upon upgrade, which simplifies behaviour.
- This also means we can simplify the connection string logic, as there is no need to replace
couchbase[s]://withhttp[s]://for gocbcore (as we used to for cbdatasource).
- This also means we can simplify the connection string logic, as there is no need to replace
- Add an integration test for the above behaviour. The upgrade itself will likely need functional testing (and/or CBG-2273).
Pre-review checklist
- [x] Removed debug logging (
fmt.Print,log.Print, ...) - [x] Logging sensitive data? Make sure it's tagged (e.g.
base.UD(docID),base.MD(dbName)) - [x] Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in
docs/api
Integration Tests
- [ ]
server=7.0.3https://jenkins.sgwdev.com/job/SyncGateway-Integration/530/ - [ ]
server=6.6.5https://jenkins.sgwdev.com/job/SyncGateway-Integration/531/ - Bucket flush failures on both. Re-ran with only the
dbpackage in https://jenkins.sgwdev.com/job/SyncGateway-Integration/585/ and https://jenkins.sgwdev.com/job/SyncGateway-Integration/586/ respectively.
@markspolakovs to manually test as well
Verified manually:
- Started a SG 3.0.0
- Ran a continuous write load against the bucket, and verified that the documents were getting imported
- Started a SG built on this branch, verified that both were continuously importing (the new SG got 8 partitions assigned)
- Killed the old SG, verified that after ~30 seconds the new SG took over its partitions and continued import without starting from zero.
Adapted the unit test to check this as well, though testing the exact sequence handling is difficult. Assigning to @adamcfraser for final code review.
Windows CI issue is #5695, rebased to pick up fix.
@markspolakovs Changes look good, but it looks like the CI ee-unit-tests are failing on the new TestShardedDCPUpgrade test, waiting stale nodes to be cleaned up. Have you already looked at that?
Test failure only occurred against Walrus - skipped it there.