sync_gateway icon indicating copy to clipboard operation
sync_gateway copied to clipboard

CBG-2213: Better handle CBGT feed rolling upgrades

Open markspolakovs opened this issue 3 years ago • 2 comments
trafficstars

CBG-2213

Change the behaviour of the sharded DCP feed to accommodate rolling online upgrades from pre-Helium to Helium nodes.

  • Rather than registering a new feed type for the gocbcore-backed feed, reuse the existing feed type - when only streaming the default collection, there is no functional difference. This means that there is no need to delete and re-create the index upon upgrade, which simplifies behaviour.
    • This also means we can simplify the connection string logic, as there is no need to replace couchbase[s]:// with http[s]:// for gocbcore (as we used to for cbdatasource).
  • Add an integration test for the above behaviour. The upgrade itself will likely need functional testing (and/or CBG-2273).

Pre-review checklist

  • [x] Removed debug logging (fmt.Print, log.Print, ...)
  • [x] Logging sensitive data? Make sure it's tagged (e.g. base.UD(docID), base.MD(dbName))
  • [x] Updated relevant information in the API specifications (such as endpoint descriptions, schemas, ...) in docs/api

Integration Tests

  • [ ] server=7.0.3 https://jenkins.sgwdev.com/job/SyncGateway-Integration/530/
  • [ ] server=6.6.5https://jenkins.sgwdev.com/job/SyncGateway-Integration/531/
  • Bucket flush failures on both. Re-ran with only the db package in https://jenkins.sgwdev.com/job/SyncGateway-Integration/585/ and https://jenkins.sgwdev.com/job/SyncGateway-Integration/586/ respectively.

markspolakovs avatar Aug 03 '22 16:08 markspolakovs

@markspolakovs to manually test as well

markspolakovs avatar Aug 09 '22 16:08 markspolakovs

Verified manually:

  1. Started a SG 3.0.0
  2. Ran a continuous write load against the bucket, and verified that the documents were getting imported
  3. Started a SG built on this branch, verified that both were continuously importing (the new SG got 8 partitions assigned)
  4. Killed the old SG, verified that after ~30 seconds the new SG took over its partitions and continued import without starting from zero.

Adapted the unit test to check this as well, though testing the exact sequence handling is difficult. Assigning to @adamcfraser for final code review.

markspolakovs avatar Aug 10 '22 16:08 markspolakovs

Windows CI issue is #5695, rebased to pick up fix.

markspolakovs avatar Aug 12 '22 16:08 markspolakovs

@markspolakovs Changes look good, but it looks like the CI ee-unit-tests are failing on the new TestShardedDCPUpgrade test, waiting stale nodes to be cleaned up. Have you already looked at that?

adamcfraser avatar Aug 16 '22 20:08 adamcfraser

Test failure only occurred against Walrus - skipped it there.

markspolakovs avatar Aug 16 '22 20:08 markspolakovs