ceph RGW: Migrate topics to data path v2

RGW: Migrate topics to data path v2

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- [ ] References tracker ticket
- [ ] Very recent bug; references commit where it was introduced
- [x] New feature (ticket optional)
- [ ] Doc update (no ticket needed)
- [ ] Code cleanup (no ticket needed)
Component impact
- [ ] Affects Dashboard, opened tracker ticket
- [ ] Affects Orchestrator, opened tracker ticket
- [x] No impact that needs to be tracked
Documentation (select at least one)
- [ ] Updates relevant documentation
- [x] No doc update is appropriate
Tests (select at least one)
- [ ] Includes unit test(s)
- [ ] Includes integration test(s)
- [ ] Includes bug reproducer
- [x] No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

Jan 17 '24 12:01 AliMasarweh

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Jan 24 '24 16:01 github-actions[bot]

pleass see the ongoing design discussion here: https://github.com/ceph/ceph/pull/54868#issuecomment-1910358687

Jan 28 '24 09:01 yuvalif

early feedback: rados is the only backend that has topic metadata, so i think we should make this migration logic rados-specific

that means we don't need to add new interfaces to rgw_sal.h, rgw_sal_filter.h, etc. and the migration code itself would be under driver/rados/ instead of class RGWPubSub

Feb 06 '24 16:02 cbodley

p.s. there are additional commits in https://github.com/ceph/ceph/pull/55152, we should base the migration work on that

Feb 06 '24 17:02 cbodley

following our discussion yesterday, i started looking into two aspects of this - cancellation and locking

cancellation

if the migration logic is still running when RadosStore::finalize() is called to shut down, we need a way to cancel it. asio has a per-op cancellation feature that would be perfect for this, but it turns out that the spawn/yield_context coroutines don't support this

i think our best option, then, is to spawn the migration coroutine on a separate io_context with its own thread. that way, RadosStore::finalize() can call io_context.stop() and thread.join(). i can add an example commit to hook that up

locking

the idea was to take a cls_lock on the v1 topics metadata object to cover the migration, but i didn't fully think through the dimensioning by tenant

we could potentially use a lock on pubsub.{tenant} to cover that topic's migration, along with any buckets under pubsub.{tenant}.bucket.*. but there could be buckets in other tenants that don't have a corresponding pubsub.{tenant} topics object at all, and we need to migrate those too

maybe it's better just to do the migration with no locking, and let the radosgws race. this is fine as long as we a) use exclusive creates to write the v2 topic metadata, and use RGWObjVersionTracker correctly when writing RGW_ATTR_BUCKET_NOTIFICATION to buckets

@yuvalif this same issue comes up with our definition of the 'migrating state' from https://github.com/ceph/ceph/pull/54868#issuecomment-1910358687:

"migrating" state will be derived implicitly from enabling of the topics v2 feature together with the existence of the v1 topics system object

but it's probably okay to block rgw_rest_pubsub requests on a per-tenant basis

Feb 08 '24 18:02 cbodley

i think our best option, then, is to spawn the migration coroutine on a separate io_context with its own thread. that way, RadosStore::finalize() can call io_context.stop() and thread.join(). i can add an example commit to hook that up

i pushed a draft commit to https://github.com/ceph/ceph/pull/55504 that does this. i added the hooks to RGWRados instead of RadosStore so the logic would be right next to rgw::notify::init()/shutdown() and reuse the run_notification_thread flag so we can avoid starting migration in radosgw-admin commands

it adds a new driver/rados/topic_migration.cc file - lets put all of the migration logic there instead of adding more to RadosStore. there are also example functions migrate_topics() and migrate_notifications() with notes about the error handling

@AliMasarweh if that looks ok to you, do you want to pull that commit into your pr branch?

Feb 08 '24 19:02 cbodley

i think our best option, then, is to spawn the migration coroutine on a separate io_context with its own thread. that way, RadosStore::finalize() can call io_context.stop() and thread.join(). i can add an example commit to hook that up

i pushed a draft commit to #55504 that does this. i added the hooks to RGWRados instead of RadosStore so the logic would be right next to rgw::notify::init()/shutdown() and reuse the run_notification_thread flag so we can avoid starting migration in radosgw-admin commands

it adds a new driver/rados/topic_migration.cc file - lets put all of the migration logic there instead of adding more to RadosStore. there are also example functions migrate_topics() and migrate_notifications() with notes about the error handling

@AliMasarweh if that looks ok to you, do you want to pull that commit into your pr branch?

Yes it looks ok, I have pulled that commit and completed the missing segments of code I think it is ready for review

Feb 12 '24 11:02 AliMasarweh

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Feb 12 '24 12:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Feb 12 '24 12:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Feb 12 '24 16:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Feb 13 '24 11:02 github-actions[bot]

PR passed manual testing

Feb 14 '24 13:02 AliMasarweh

what are the next steps for test coverage? it seems that we need some script that creates lots of v1 metadata across several tenants (including empty tenant), then enables v2 and verifies that we end up with valid v2 metadata and that all v1 metadata objects get deleted

Feb 15 '24 15:02 cbodley

what are the next steps for test coverage? it seems that we need some script that creates lots of v1 metadata across several tenants (including empty tenant), then enables v2 and verifies that we end up with valid v2 metadata and that all v1 metadata objects get deleted

Sure, where should I write the test, we usually write tests in test_bn.py, but I don't think this runs a multi-site cluster

Feb 18 '24 15:02 AliMasarweh

what are the next steps for test coverage? it seems that we need some script that creates lots of v1 metadata across several tenants (including empty tenant), then enables v2 and verifies that we end up with valid v2 metadata and that all v1 metadata objects get deleted

Sure, where should I write the test, we usually write tests in test_bn.py, but I don't think this runs a multi-site cluster

we don't need a multisite cluster, we need a single site but with a realm defined. with this commit: https://github.com/ceph/ceph/commit/f01871015df4fa2b8340430b78551e07a83ad6dd the test_bn.py script can run againts a single zone in a multisite setup (or a single site setup with a realm). this means that you can write a python test similar to the instruction here: https://gist.github.com/yuvalif/21449e301732b719cd1ed97c3eeeabb2 to test the migration. doing that locally should be relatively straight forward, however, running that in teuthology would require a different setup that we currently use in the "rgw:notifications" suite. so, i would recommend, marking this initially as a manual test, so it won't run in teuthology. and adding it to teuthology in a separate effort.

Feb 19 '24 07:02 yuvalif

couple of notes regarding the testing:

to cover possible race condition between RGWs (in the same zone) we should test with env var: RGW_PER_ZONE=2
please make sure that there are 2 tenants in th test

Feb 19 '24 11:02 yuvalif

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Feb 20 '24 09:02 github-actions[bot]

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Feb 22 '24 20:02 github-actions[bot]

jenkins test make check

Mar 04 '24 09:03 AliMasarweh

jenkins test make check

Mar 04 '24 12:03 AliMasarweh

jenkins test make check

Mar 04 '24 14:03 AliMasarweh

jenkins test make check

Mar 04 '24 15:03 AliMasarweh

i would like to see more detailed test coverage of the metadata migration itself. it doesn't need to start any notification endpoints or generate any notifications. it just needs to create lots of v1 metadata and verify that we end up with the right v2 metadata

i think the easiest way to test that is to expose a radosgw-admin command that calls rgwrados::topic_migration::migrate() directly. if you stop all radosgws before enabling notification_v2, we can ensure that this radosgw-admin command handles all of the migration. the benefit of using a command for testing is that you can be sure that all migration has completed once the command returns. if we rely on radosgws to run this migration code in the background, we don't know precisely when it completes - we'd have to keep polling and sleeping until the metadata is gone, making the test slower and less reliable

requested test coverage:

create enough metadata objects to exceed the pagination limit. that's currently hard-coded as constexpr uint32_t max = 100, but the radosgw-admin command could take a custom --max-entries for that and pass a smaller value
create topic and notification metadata under several different tenant names, including the empty tenant
after completion, verify with rados ls that all v1 metadata objects are removed
verify that v2 metadata can still be read via apis like ListTopics and GetBucketNotificationConfiguration (restart radosgws once the radosgw-admin command completes)
test migration behavior when v2 metadata already exists. consider a radosgw-admin option like --skip-deletes that skips the deletion of v1 metadata. that way we can run the command once with that flag, then again without to verify that it doesn't cause migration to fail

Mar 04 '24 17:03 cbodley

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

Mar 05 '24 17:03 github-actions[bot]

i think the easiest way to test that is to expose a radosgw-admin command that calls rgwrados::topic_migration::migrate() directly.

the command will be for test purposes only?

if you stop all radosgws before enabling notification_v2, we can ensure that this radosgw-admin command handles all of the migration. the benefit of using a command for testing is that you can be sure that all migration has completed once the command returns.

the problem is that, AFAIK, we don't have infra in teuthology to start and stop RGWs. this means that this will be a local test only. we can probably add that, but this will more development in teuthology (which we want to minimize).

if we rely on radosgws to run this migration code in the background, we don't know precisely when it completes - we'd have to keep polling and sleeping until the metadata is gone, making the test slower and less reliable

currently we are just doing "sleep" which creates problems in both directions (flaky tests and/or long test), we can use radosgw-admin to poll for the topic list, it should fail as long as we are migrating. don't see that as a big issue

requested test coverage:

create enough metadata objects to exceed the pagination limit. that's currently hard-coded as constexpr uint32_t max = 100, but the radosgw-admin command could take a custom --max-entries for that and pass a smaller value

create topic and notification metadata under several different tenant names, including the empty tenant

after completion, verify with rados ls that all v1 metadata objects are removed

verify that v2 metadata can still be read via apis like ListTopics and GetBucketNotificationConfiguration (restart radosgws once the radosgw-admin command completes)

test migration behavior when v2 metadata already exists. consider a radosgw-admin option like --skip-deletes that skips the deletion of v1 metadata. that way we can run the command once with that flag, then again without to verify that it doesn't cause migration to fail

sounds good

Mar 06 '24 10:03 yuvalif

the problem is that, AFAIK, we don't have infra in teuthology to start and stop RGWs. this means that this will be a local test only. we can probably add that, but this will more development in teuthology (which we want to minimize).

there are a couple ways to do that in teuthology, but you're right that it complicates matters. maybe we can avoid that altogether by preventing the rgws from doing the migration themselves

currently, RGWRados::init_complete() only calls into rgwrados::topic_migration::migrate() when run_notification_thread is true, so that radosgw-admin commands don't start the background migration process. rgw_admin.cc always passes false, and radosgw (via rgw::AppMain::init_storage()) always passes true

if we add a config option rgw_run_notification_thread similar to rgw_run_sync_thread to control that, we could turn that off for the teuthology job that runs this migration test. then we could be sure that our new radosgw-admin command is the only one running the migration code

obviously we'd need to separate those migration test cases from the rgw/notifications jobs which do depend on that notification thread

Mar 06 '24 17:03 cbodley

after completion, verify with rados ls that all v1 metadata objects are removed

using rados ls might be tricky if you don't know exactly what pool name to expect. as an alternative, the proposed radosgw-admin command could return a json array of all objects visited. after running the migration once, we could run the command again to verify that it returns an empty array

Mar 06 '24 17:03 cbodley

@cbodley, @AliMasarweh:

we will add the folowing test cases as part of test_bn.py:

larger scale migration

create 2 additional users that belong to 2 named tenants
disable v2
create 100 topics+buckets+notifications for each of the above users and for the default user (with default tenant)
enable v2
poll topic list comamnd that should fail untill migration is complete
once the command is successfull, verify that we can get the 300 topics and notifications

migration with v2 configuration

create 2 additional users that belong to 2 named tenants
create topic+bucket+notification for each of the above users and for the default user (with default tenant)
disable v2
create topic+bucket+notification for each of the above users and for the default user (with default tenant)
enable v2
wait for migration to complete
verify that we can get the 6 topics and notifications

currently, the above new tests as well as the ones already created, will run locally. in order to run them in teuthology, we would first need https://github.com/ceph/ceph/pull/56005 once it is merged, we should revert the change from here: https://github.com/ceph/ceph/pull/55214#pullrequestreview-1914611297 and also populate the bntests.conf.SAMPLE file used in teuthology with the correct values. see: https://github.com/ceph/ceph/blob/main/qa/tasks/notification_tests.py

Mar 07 '24 09:03 yuvalif

thanks @yuvalif

poll topic list comamnd that should fail untill migration is complete

the "topic list" failure mode is per-tenant. it may succeed even though other tenants haven't finished topic migration. it may also succeed if the tenant's 'topics' object migrated, but we haven't finished migrating all of the tenant's bucket notifications

Mar 07 '24 17:03 cbodley

thanks @yuvalif

poll topic list comamnd that should fail untill migration is complete

the "topic list" failure mode is per-tenant. it may succeed even though other tenants haven't finished topic migration. it may also succeed if the tenant's 'topics' object migrated, but we haven't finished migrating all of the tenant's bucket notifications

We can call topic list for each of the tenants. I think that the call should fail as ling as we are migrating

Mar 07 '24 18:03 yuvalif