ceph icon indicating copy to clipboard operation
ceph copied to clipboard

RGW: Migrate topics to data path v2

Open AliMasarweh opened this issue 1 year ago • 44 comments

RGW: Migrate topics to data path v2

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • [ ] References tracker ticket
    • [ ] Very recent bug; references commit where it was introduced
    • [x] New feature (ticket optional)
    • [ ] Doc update (no ticket needed)
    • [ ] Code cleanup (no ticket needed)
  • Component impact
    • [ ] Affects Dashboard, opened tracker ticket
    • [ ] Affects Orchestrator, opened tracker ticket
    • [x] No impact that needs to be tracked
  • Documentation (select at least one)
    • [ ] Updates relevant documentation
    • [x] No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

AliMasarweh avatar Jan 17 '24 12:01 AliMasarweh

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Jan 24 '24 16:01 github-actions[bot]

pleass see the ongoing design discussion here: https://github.com/ceph/ceph/pull/54868#issuecomment-1910358687

yuvalif avatar Jan 28 '24 09:01 yuvalif

early feedback: rados is the only backend that has topic metadata, so i think we should make this migration logic rados-specific

that means we don't need to add new interfaces to rgw_sal.h, rgw_sal_filter.h, etc. and the migration code itself would be under driver/rados/ instead of class RGWPubSub

cbodley avatar Feb 06 '24 16:02 cbodley

p.s. there are additional commits in https://github.com/ceph/ceph/pull/55152, we should base the migration work on that

cbodley avatar Feb 06 '24 17:02 cbodley

following our discussion yesterday, i started looking into two aspects of this - cancellation and locking

cancellation

if the migration logic is still running when RadosStore::finalize() is called to shut down, we need a way to cancel it. asio has a per-op cancellation feature that would be perfect for this, but it turns out that the spawn/yield_context coroutines don't support this

i think our best option, then, is to spawn the migration coroutine on a separate io_context with its own thread. that way, RadosStore::finalize() can call io_context.stop() and thread.join(). i can add an example commit to hook that up

locking

the idea was to take a cls_lock on the v1 topics metadata object to cover the migration, but i didn't fully think through the dimensioning by tenant

we could potentially use a lock on pubsub.{tenant} to cover that topic's migration, along with any buckets under pubsub.{tenant}.bucket.*. but there could be buckets in other tenants that don't have a corresponding pubsub.{tenant} topics object at all, and we need to migrate those too

maybe it's better just to do the migration with no locking, and let the radosgws race. this is fine as long as we a) use exclusive creates to write the v2 topic metadata, and use RGWObjVersionTracker correctly when writing RGW_ATTR_BUCKET_NOTIFICATION to buckets

@yuvalif this same issue comes up with our definition of the 'migrating state' from https://github.com/ceph/ceph/pull/54868#issuecomment-1910358687:

"migrating" state will be derived implicitly from enabling of the topics v2 feature together with the existence of the v1 topics system object

but it's probably okay to block rgw_rest_pubsub requests on a per-tenant basis

cbodley avatar Feb 08 '24 18:02 cbodley

i think our best option, then, is to spawn the migration coroutine on a separate io_context with its own thread. that way, RadosStore::finalize() can call io_context.stop() and thread.join(). i can add an example commit to hook that up

i pushed a draft commit to https://github.com/ceph/ceph/pull/55504 that does this. i added the hooks to RGWRados instead of RadosStore so the logic would be right next to rgw::notify::init()/shutdown() and reuse the run_notification_thread flag so we can avoid starting migration in radosgw-admin commands

it adds a new driver/rados/topic_migration.cc file - lets put all of the migration logic there instead of adding more to RadosStore. there are also example functions migrate_topics() and migrate_notifications() with notes about the error handling

@AliMasarweh if that looks ok to you, do you want to pull that commit into your pr branch?

cbodley avatar Feb 08 '24 19:02 cbodley

i think our best option, then, is to spawn the migration coroutine on a separate io_context with its own thread. that way, RadosStore::finalize() can call io_context.stop() and thread.join(). i can add an example commit to hook that up

i pushed a draft commit to #55504 that does this. i added the hooks to RGWRados instead of RadosStore so the logic would be right next to rgw::notify::init()/shutdown() and reuse the run_notification_thread flag so we can avoid starting migration in radosgw-admin commands

it adds a new driver/rados/topic_migration.cc file - lets put all of the migration logic there instead of adding more to RadosStore. there are also example functions migrate_topics() and migrate_notifications() with notes about the error handling

@AliMasarweh if that looks ok to you, do you want to pull that commit into your pr branch?

Yes it looks ok, I have pulled that commit and completed the missing segments of code I think it is ready for review

AliMasarweh avatar Feb 12 '24 11:02 AliMasarweh

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Feb 12 '24 12:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Feb 12 '24 12:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Feb 12 '24 16:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Feb 13 '24 11:02 github-actions[bot]

PR passed manual testing

AliMasarweh avatar Feb 14 '24 13:02 AliMasarweh

what are the next steps for test coverage? it seems that we need some script that creates lots of v1 metadata across several tenants (including empty tenant), then enables v2 and verifies that we end up with valid v2 metadata and that all v1 metadata objects get deleted

cbodley avatar Feb 15 '24 15:02 cbodley

what are the next steps for test coverage? it seems that we need some script that creates lots of v1 metadata across several tenants (including empty tenant), then enables v2 and verifies that we end up with valid v2 metadata and that all v1 metadata objects get deleted

Sure, where should I write the test, we usually write tests in test_bn.py, but I don't think this runs a multi-site cluster

AliMasarweh avatar Feb 18 '24 15:02 AliMasarweh

what are the next steps for test coverage? it seems that we need some script that creates lots of v1 metadata across several tenants (including empty tenant), then enables v2 and verifies that we end up with valid v2 metadata and that all v1 metadata objects get deleted

Sure, where should I write the test, we usually write tests in test_bn.py, but I don't think this runs a multi-site cluster

we don't need a multisite cluster, we need a single site but with a realm defined. with this commit: https://github.com/ceph/ceph/commit/f01871015df4fa2b8340430b78551e07a83ad6dd the test_bn.py script can run againts a single zone in a multisite setup (or a single site setup with a realm). this means that you can write a python test similar to the instruction here: https://gist.github.com/yuvalif/21449e301732b719cd1ed97c3eeeabb2 to test the migration. doing that locally should be relatively straight forward, however, running that in teuthology would require a different setup that we currently use in the "rgw:notifications" suite. so, i would recommend, marking this initially as a manual test, so it won't run in teuthology. and adding it to teuthology in a separate effort.

yuvalif avatar Feb 19 '24 07:02 yuvalif

couple of notes regarding the testing:

  • to cover possible race condition between RGWs (in the same zone) we should test with env var: RGW_PER_ZONE=2
  • please make sure that there are 2 tenants in th test

yuvalif avatar Feb 19 '24 11:02 yuvalif

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Feb 20 '24 09:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Feb 22 '24 20:02 github-actions[bot]

jenkins test make check

AliMasarweh avatar Mar 04 '24 09:03 AliMasarweh

jenkins test make check

AliMasarweh avatar Mar 04 '24 12:03 AliMasarweh

jenkins test make check

AliMasarweh avatar Mar 04 '24 14:03 AliMasarweh

jenkins test make check

AliMasarweh avatar Mar 04 '24 15:03 AliMasarweh

i would like to see more detailed test coverage of the metadata migration itself. it doesn't need to start any notification endpoints or generate any notifications. it just needs to create lots of v1 metadata and verify that we end up with the right v2 metadata

i think the easiest way to test that is to expose a radosgw-admin command that calls rgwrados::topic_migration::migrate() directly. if you stop all radosgws before enabling notification_v2, we can ensure that this radosgw-admin command handles all of the migration. the benefit of using a command for testing is that you can be sure that all migration has completed once the command returns. if we rely on radosgws to run this migration code in the background, we don't know precisely when it completes - we'd have to keep polling and sleeping until the metadata is gone, making the test slower and less reliable

requested test coverage:

  • create enough metadata objects to exceed the pagination limit. that's currently hard-coded as constexpr uint32_t max = 100, but the radosgw-admin command could take a custom --max-entries for that and pass a smaller value
  • create topic and notification metadata under several different tenant names, including the empty tenant
  • after completion, verify with rados ls that all v1 metadata objects are removed
  • verify that v2 metadata can still be read via apis like ListTopics and GetBucketNotificationConfiguration (restart radosgws once the radosgw-admin command completes)
  • test migration behavior when v2 metadata already exists. consider a radosgw-admin option like --skip-deletes that skips the deletion of v1 metadata. that way we can run the command once with that flag, then again without to verify that it doesn't cause migration to fail

cbodley avatar Mar 04 '24 17:03 cbodley

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

github-actions[bot] avatar Mar 05 '24 17:03 github-actions[bot]

i think the easiest way to test that is to expose a radosgw-admin command that calls rgwrados::topic_migration::migrate() directly.

the command will be for test purposes only?

if you stop all radosgws before enabling notification_v2, we can ensure that this radosgw-admin command handles all of the migration. the benefit of using a command for testing is that you can be sure that all migration has completed once the command returns.

the problem is that, AFAIK, we don't have infra in teuthology to start and stop RGWs. this means that this will be a local test only. we can probably add that, but this will more development in teuthology (which we want to minimize).

if we rely on radosgws to run this migration code in the background, we don't know precisely when it completes - we'd have to keep polling and sleeping until the metadata is gone, making the test slower and less reliable

currently we are just doing "sleep" which creates problems in both directions (flaky tests and/or long test), we can use radosgw-admin to poll for the topic list, it should fail as long as we are migrating. don't see that as a big issue

requested test coverage:

  • create enough metadata objects to exceed the pagination limit. that's currently hard-coded as constexpr uint32_t max = 100, but the radosgw-admin command could take a custom --max-entries for that and pass a smaller value
  • create topic and notification metadata under several different tenant names, including the empty tenant
  • after completion, verify with rados ls that all v1 metadata objects are removed
  • verify that v2 metadata can still be read via apis like ListTopics and GetBucketNotificationConfiguration (restart radosgws once the radosgw-admin command completes)
  • test migration behavior when v2 metadata already exists. consider a radosgw-admin option like --skip-deletes that skips the deletion of v1 metadata. that way we can run the command once with that flag, then again without to verify that it doesn't cause migration to fail

sounds good

yuvalif avatar Mar 06 '24 10:03 yuvalif

the problem is that, AFAIK, we don't have infra in teuthology to start and stop RGWs. this means that this will be a local test only. we can probably add that, but this will more development in teuthology (which we want to minimize).

there are a couple ways to do that in teuthology, but you're right that it complicates matters. maybe we can avoid that altogether by preventing the rgws from doing the migration themselves

currently, RGWRados::init_complete() only calls into rgwrados::topic_migration::migrate() when run_notification_thread is true, so that radosgw-admin commands don't start the background migration process. rgw_admin.cc always passes false, and radosgw (via rgw::AppMain::init_storage()) always passes true

if we add a config option rgw_run_notification_thread similar to rgw_run_sync_thread to control that, we could turn that off for the teuthology job that runs this migration test. then we could be sure that our new radosgw-admin command is the only one running the migration code

obviously we'd need to separate those migration test cases from the rgw/notifications jobs which do depend on that notification thread

cbodley avatar Mar 06 '24 17:03 cbodley

  • after completion, verify with rados ls that all v1 metadata objects are removed

using rados ls might be tricky if you don't know exactly what pool name to expect. as an alternative, the proposed radosgw-admin command could return a json array of all objects visited. after running the migration once, we could run the command again to verify that it returns an empty array

cbodley avatar Mar 06 '24 17:03 cbodley

@cbodley, @AliMasarweh:

we will add the folowing test cases as part of test_bn.py:

larger scale migration

  • create 2 additional users that belong to 2 named tenants
  • disable v2
  • create 100 topics+buckets+notifications for each of the above users and for the default user (with default tenant)
  • enable v2
  • poll topic list comamnd that should fail untill migration is complete
  • once the command is successfull, verify that we can get the 300 topics and notifications

migration with v2 configuration

  • create 2 additional users that belong to 2 named tenants
  • create topic+bucket+notification for each of the above users and for the default user (with default tenant)
  • disable v2
  • create topic+bucket+notification for each of the above users and for the default user (with default tenant)
  • enable v2
  • wait for migration to complete
  • verify that we can get the 6 topics and notifications

currently, the above new tests as well as the ones already created, will run locally. in order to run them in teuthology, we would first need https://github.com/ceph/ceph/pull/56005 once it is merged, we should revert the change from here: https://github.com/ceph/ceph/pull/55214#pullrequestreview-1914611297 and also populate the bntests.conf.SAMPLE file used in teuthology with the correct values. see: https://github.com/ceph/ceph/blob/main/qa/tasks/notification_tests.py

yuvalif avatar Mar 07 '24 09:03 yuvalif

thanks @yuvalif

  • poll topic list comamnd that should fail untill migration is complete

the "topic list" failure mode is per-tenant. it may succeed even though other tenants haven't finished topic migration. it may also succeed if the tenant's 'topics' object migrated, but we haven't finished migrating all of the tenant's bucket notifications

cbodley avatar Mar 07 '24 17:03 cbodley

thanks @yuvalif

  • poll topic list comamnd that should fail untill migration is complete

the "topic list" failure mode is per-tenant. it may succeed even though other tenants haven't finished topic migration. it may also succeed if the tenant's 'topics' object migrated, but we haven't finished migrating all of the tenant's bucket notifications

We can call topic list for each of the tenants. I think that the call should fail as ling as we are migrating

yuvalif avatar Mar 07 '24 18:03 yuvalif