neon icon indicating copy to clipboard operation
neon copied to clipboard

Epic: move outbound logical replication out of Beta

Open stepashka opened this issue 1 year ago • 18 comments

DoD

logical replication is not in beta on neon anymore and wal_level = logical can be enabled by default on all project on neon platform

### Tasks & bugs to fix
- [ ] https://github.com/neondatabase/neon/issues/6370
- [ ] https://github.com/neondatabase/neon/issues/6371
- [ ] https://github.com/neondatabase/neon/pull/6221
- [ ] https://github.com/neondatabase/neon/issues/6182
- [ ] https://github.com/neondatabase/neon/issues/6229
- [ ] https://github.com/neondatabase/cloud/pull/9015
- [ ] https://github.com/neondatabase/neon/issues/6257
- [ ] https://github.com/neondatabase/neon/issues/7593
- [x] Check that AUX v2 is default for all new tenants on pageserver
- [ ] https://github.com/neondatabase/neon/issues/8349
- [ ] https://github.com/neondatabase/cloud/issues/15226
- [ ] https://github.com/neondatabase/neon/issues/6626
- [x] Figure out observability of lagging publisher (slot retaining a lot of WAL)
- [ ] https://github.com/neondatabase/neon/issues/5885
- [ ] https://github.com/neondatabase/neon/issues/8931
- [ ] https://github.com/neondatabase/cloud/issues/17261
- [ ] https://github.com/neondatabase/neon/issues/8619
- [ ] Logical slots are copied to the replica and prevent WAL truncation, see https://github.com/neondatabase/neon/pull/9425#discussion_r1804820659

Follow-ups (out of scope):

  • https://github.com/neondatabase/neon/issues/6258

Other related tasks and Epics

  • https://github.com/neondatabase/cloud/issues/8892

https://neondb.slack.com/archives/C04DGM6SMTM/p1703091242312799

stepashka avatar Dec 21 '23 11:12 stepashka

@arssher , will your fixes help with first two items?

vadim2404 avatar Jan 02 '24 17:01 vadim2404

slot may disappear on restart (hard to reproduce., only occured once)

I still don't know what was that. Stas tested manually and observed this once. There is known path by which slot might be lost, but this is highly unlikely (endpoint killed before logical message is committed to safekeepers), Stas case wasn't like that. Need more testing and reproduction.

in one case replication wasn't able to read WAL (hard to reproduce)

The more proper description would be 'if slot is lagging, on compute start replication might fail until the whole tail is downloaded'. We merged cap on max allowed lagging, but to really fix it we need to bring on demand WAL download from safekeepers to logical walsenders. We recently merged core patch: https://github.com/neondatabase/neon/pull/5948 but using it in logical walsenders is separate step. Shouldn't be hard, but I haven't started on that.

arssher avatar Jan 02 '24 18:01 arssher

@arssher will you work on on-demand WAL download in walsenders? Is it a part of this epic [will it block announcing GA for logical replication]?

@kelvich

slot may disappear on restart (hard to reproduce., only occured once) I think nobody has been able to reproduce it so far. Reasonable question: is it a part of the epic's scope?

vadim2404 avatar Jan 12 '24 09:01 vadim2404

Renamed to "outbound" logical replication. When this Epic was started, it was "only" logical replication, but now it's two different types of replication.

andreasscherbaum avatar Feb 27 '24 16:02 andreasscherbaum

Discussion with Stas: improve pageserver performance first

andreasscherbaum avatar Mar 04 '24 09:03 andreasscherbaum

This week:

  • [x] Sasha: Finish the aux v2 rollout (last batch tomorrow morning)

ololobus avatar Jul 09 '24 16:07 ololobus

This week:

  • [x] Waiting for the new compute image rollout

ololobus avatar Jul 16 '24 15:07 ololobus