courtlistener icon indicating copy to clipboard operation
courtlistener copied to clipboard

Upgrade replication customers to postgres 16 (Deadline Feb. 29)

Open mlissner opened this issue 1 year ago • 35 comments

Seven of our replication customers are on posgres 11, which AWS plans to deprecate. I have communicated with all of them and they are all on board with a switch to postgres 16 at the proxy level.

Here's the email I sent:

Hello, I'm writing because your organization subscribes to Free Law Project's replicated database service, and the database version that your organization is using will soon no longer be supported at our host, AWS.

We need to make a decision about what to do:

  1. Recommended: Swap in a new replica.

Unfortunately, it is not possible to upgrade a Postgresql replication subscriber in place. Therefore, the best thing Free Law Project can do is create a fresh replica for you using the latest version of Postgresql and to then swap this new database with your current one.

This will be seamless for you and you will immediately be upgraded.

  1. Possible to buy time: Do nothing, keep your old DB, and pay more forever.

If you have made customizations to your replica, option 2 is to use and pay for AWS's long term support service at an additional monthly cost of approximately $300 for years 1 and 2, and $600 for year three. This will allow you to avoid upgrading for up to three years and means there is no need to take any action until then.

Please let me know what would work best for your organization.

I highly recommend most organizations choose option 1. The latest version of Postgresql includes numerous performance and feature upgrades. Furthermore, frankly, if we don't have anybody choosing option 2, our maintenance will be simpler and more reliable for all. If you are considering option 2, let's discuss — there is likely time to achieve option 1 successfully.

Timeline Your version of Postgresql will be deprecated by AWS on February 29th. If we do not have a decision by that time, you will begin paying the long term support fees.

I hope to have a plan of action for this from all customers within one week from today, and aim to begin upgrading Option 1 customers shortly after that.

Thank you, as always, for being our supporters, customers, and advocates. We're fortunate to be able to work with you all. I look forward to any questions or concerns you may have.

The simple way to do this will be to:

  1. Create seven new replicas in postgresql 16
  2. Swap to the new replicas at the proxy level
  3. Terminate the old replicas.

But, there's a clever way too that fixes https://github.com/freelawproject/courtlistener/pull/3546 and then https://github.com/freelawproject/courtlistener/pull/3558 along the way:

  1. Create the new RDS instances.
  2. Set up their schemas
  3. Do the two PRs across all replicas
  4. Subscribe the new replicas
  5. Once sync'ed, swap to the new replicas at the proxy level
  6. Wait a week, say
  7. Terminate the old replicas

The advantage of this approach is that step 3 won't have to delete a huge index nor create a new huge index. Instead, it will simply do that work on an empty database. Bonus.

mlissner avatar Jan 16 '24 22:01 mlissner

Current plan:

  • Upgrade our two v16 customers by migrating their systems.
  • Migrate prod and cl-replica
  • Make new instances in v16 for our seven v11 customers.
  • Swap those in.

mlissner avatar Feb 14 '24 01:02 mlissner

Prod, cl-replica, and our two pg16 servers have the index fixes.

Next step is to make the new instances for our clients.

mlissner avatar Feb 14 '24 06:02 mlissner

Hey @mlissner

Roughly this are the steps I was able to come up with. Let me know your thoughts.

  1. Launch the new instance for ferret. Will try the Pulumi workflow to create it, it will probably won't work on the first run. New instance will have the identifier ferret-16 while syncing. Password should be automatically set and stored in Secrets Manager.

  2. Create the schema: I am assuming we are taking the schema from ferret directly. If not we could take the tables from the publication_tables like this maybe:

TABLES=$( psql --host $DB_REPLICA_HOST -p 5432 -d courtlistener -U django  --tuples-only -c \
        "SELECT CONCAT(schemaname, '.', tablename) FROM pg_publication_tables WHERE pubname='opendata_ferret';" \
        | xargs -I{} echo -n " -t {}"
        )

Then pull the schema using:

 pg_dump --host $X_HOST \
   --username django \
   --create \
   --schema-only \
   --no-privileges \
   --no-publications \
   --no-subscriptions courtlistener \
   $TABLES \
   --password | grep -v -i 'trigger'

Finally import it into the new ferret replica:

psql --host ferret-16.xxxxxx.us-west-2.rds.amazonaws.com \
       --port=5432 --user django --dbname postgres < ferret_schema.2024-04-05.sql
  1. Create the publication on replica.

In this case, based on the publications I found on the replica, the new publication could be called ferret_publication since the current one is opendata_ferret.

Will create it using the following stating the same tables as the current publication has:

CREATE PUBLICATION ferret_publication FOR TABLE ONLY ...
  1. Role.

    Could it be possible to re-use the current role that ferret uses ?

    If I not, I am not sure what would a proper new name would be.

    Also not sure where those passwords are stored.

  2. Create subscription on ferrer-16.

CREATE SUBSCRIPTION ferret_subscription CONNECTION 'host=cl-replica.xxx.amazonaws.com port=5432 user=ferret password=XXX dbname=courtlistener' PUBLICATION ferret_publication;

This will get the replication started.

When it's synced up and we are ready to do the switch, we can configure haproxy to use both backends to minimize new connections from failing during dns update (ferret.xxx and ferret-16.xxx), update the database identifiers and then remove the ferret-16 backend.

Or we can just go ahead and do the switch by simply modifying the identifiers, open connections should remain connected.

I will be documenting this process here: Upgrading/Migrating a replica

blancoramiro avatar Apr 05 '24 17:04 blancoramiro

Yeah, that sounds basically right. I assume this is all based on our usual replication set up?

A few bits of feedback:

I am assuming we are taking the schema from ferret directly.

I'd actually use an export of the schema from prod, as described on the wiki. It's the ground truth.

Could it be possible to re-use the current role that ferret uses ?

Yes. I have these passwords in my password manager. I imagine there's a better way.

we can just go ahead and do the switch by simply modifying the identifiers, open connections should remain connected.

This seems easier and fine to me. I'd just suggest we do it during a maintenance window.

mlissner avatar Apr 05 '24 17:04 mlissner

Sounds good!

I'll go ahead and start the new instance. Will basically use the steps on the Setting up Replication for a new Customer page.

I will need the ferret role's password, we can use the secret manager. We can a secret called cl_replicas and store the passwords there. Each secret costs 0.40/month.

Alternative we can search for a password/secrets sharing tool. Like Dashlane, LastPass or similar. Not a must and we will be using the secret manager either way.

I checked the permissions of all users and none have any secretmanager: permission set.

blancoramiro avatar Apr 05 '24 18:04 blancoramiro

Great. Let me know when you need that password and I'll send it along.

mlissner avatar Apr 05 '24 20:04 mlissner

Just created the instance and imported the schema from the prod db. I am about to create the publication ferret_publication, just want to confirm that these are the correct tables to add to the publication:

CREATE PUBLICATION ferret_publication FOR TABLE ONLY
    --Audio
    audio_audio, audio_audio_panel, 
    --Disclosures
    disclosures_agreement, disclosures_debt, disclosures_financialdisclosure, disclosures_gift, disclosures_investment, disclosures_noninvestmentincome, disclosures_position, disclosures_reimbursement, disclosures_spouseincome,
    --People
    people_db_abarating, people_db_attorney, people_db_attorneyorganization, people_db_attorneyorganizationassociation, people_db_criminalcomplaint, people_db_criminalcount, people_db_education, people_db_party, people_db_partytype, people_db_person, people_db_person_race, people_db_politicalaffiliation, people_db_position, people_db_race, people_db_role, people_db_school, people_db_source,
    --FJC
    recap_fjcintegrateddatabase,
    --Search
    search_bankruptcyinformation, search_citation, search_claim, search_claim_tags, search_claimhistory, search_court, search_court_appeals_to, search_courthouse, search_docket, search_docket_panel, search_docket_tags, search_docketentry, search_docketentry_tags, search_opinion, search_opinion_joined_by, search_opinioncluster, search_opinioncluster_non_participating_judges, search_opinioncluster_panel, search_opinionscited, search_opinionscitedbyrecapdocument, search_originatingcourtinformation, search_parenthetical, search_parentheticalgroup, search_recapdocument, search_recapdocument_tags,  search_tag
    ;

And right after that I'll create the subscription on the new instance and I will need the ferret role's password.

ty!

blancoramiro avatar Apr 05 '24 21:04 blancoramiro

That looks about right. Here's what I have from my notes (the wiki can be a bit outdated):

CREATE PUBLICATION cinnamon_publication FOR TABLE ONLY
    --Audio
    audio_audio, audio_audio_panel, 
    --Disclosures
    disclosures_agreement, disclosures_debt, disclosures_financialdisclosure, disclosures_gift, disclosures_investment, disclosures_noninvestmentincome, disclosures_position, disclosures_reimbursement, disclosures_spouseincome,
    --People
    people_db_abarating, people_db_attorney, people_db_attorneyorganization, people_db_attorneyorganizationassociation, people_db_criminalcomplaint, people_db_criminalcount, people_db_education, people_db_party, people_db_partytype, people_db_person, people_db_person_race, people_db_politicalaffiliation, people_db_position, people_db_race, people_db_role, people_db_school, people_db_source,
    --FJC
    recap_fjcintegrateddatabase,
    --Search
    search_bankruptcyinformation, search_citation, search_claim, search_claim_tags, search_claimhistory, search_court, search_court_appeals_to, search_courthouse, search_docket, search_docket_panel, search_docket_tags, search_docketentry, search_docketentry_tags, search_opinion, search_opinion_joined_by, search_opinioncluster, search_opinioncluster_non_participating_judges, search_opinioncluster_panel, search_opinionscited, search_opinionscitedbyrecapdocument, search_originatingcourtinformation, search_parenthetical, search_parentheticalgroup, search_recapdocument, search_recapdocument_tags,  search_tag
    ;

mlissner avatar Apr 05 '24 21:04 mlissner

Did you see that we also need to make alarms in CloudWatch for new replicas?

mlissner avatar Apr 05 '24 21:04 mlissner

Just created the alarm:

Image

And added it to the composite:

Image

blancoramiro avatar Apr 05 '24 21:04 blancoramiro

What's pending for now is setting up the publication and the subscription using the ferret role I think.

blancoramiro avatar Apr 05 '24 21:04 blancoramiro

Password sent via Whatsapp.

mlissner avatar Apr 05 '24 21:04 mlissner

Just kicked the replication off had to add the public ip of the new instance to the rds-launch-wizard sg.

I haven't seen any errors on ferret-16's logs so far, the new slots are already there: lag.

Image

Do you use any tool to check the integrity of the replicas ? With databases this big probably not but I might just ask.

blancoramiro avatar Apr 05 '24 22:04 blancoramiro

I usually check the logs on both sides, monitor the replication lag, and check the network throughput. If you see networking going way up, you know the data is flowing. If there are no errors in the logs, that's good. If the replication lag has a few new slots that makes sense!

mlissner avatar Apr 05 '24 23:04 mlissner

Got it! Well will keep an eye on the logs and metrics. So far so good.

Image

blancoramiro avatar Apr 05 '24 23:04 blancoramiro

Saw some errors in the log from the new replica:

2024-04-07 18:37:53 UTC::@:[8830]:ERROR: could not start initial contents copy for table "public.search_opinionscitedbyrecapdocument": ERROR: permission denied for table search_opinionscitedbyrecapdocument

Granted SELECT access to those tables to ferret role:

public.search_court_appeals_to public.search_opinionscitedbyrecapdocument public.search_parenthetical public.search_parentheticalgroup

blancoramiro avatar Apr 07 '24 18:04 blancoramiro

Perfect.

mlissner avatar Apr 08 '24 19:04 mlissner

Hey @mlissner . I am starting the new replica for wsj2, will call it wsj2-16.

Not sure if you are OK using this suffix for the new replicas. Again I will use the same template as tumeric.

Will use these values: Publication name: wsj_publication Will use role wsj (Will need to ask you for the password and will make sure the role has access to all the tables in the subscription)

Let me know if this makes sense to you. Thank you!

blancoramiro avatar Apr 10 '24 21:04 blancoramiro

That all sounds right to me. I'd go with wsj3 for the DB name though. We know the postgresql version. What we don't know is which iteration of the DB we're working with when we glance at it.

mlissner avatar Apr 10 '24 21:04 mlissner

Thank you @mlissner . Just started the replication for wsj3. I will keep an eye on the replication and let you know. I created the alarm and added it to the composite.

blancoramiro avatar Apr 11 '24 01:04 blancoramiro

Hello @mlissner just wanted to let you know that wsj3 seems to be all synced up.

I also needed to ask you what would you like to do with the old ferret replica ? We made the swap on Monday.

Thank you!

blancoramiro avatar Apr 12 '24 15:04 blancoramiro

Great about wsj3. You can swap that at the proxy and once you've done that I'll let them know it's done and ask if we can tear down wsj2.

For ferret, I just emailed a confirmation with the client.

mlissner avatar Apr 12 '24 15:04 mlissner

Just made the swap. I made sure I was able to log using the same django credentails into the new replica, applied the changes and reloaded HAproxy.

Now wasabi.courtlistener.com connects to wsj3.

Will keep an eye on the sessions and monitoring.

blancoramiro avatar Apr 12 '24 19:04 blancoramiro

Just an update:

  • Created keystone2 RDS instance.
  • Created storage alarm and added it to the composite one.
  • Replication is in place, no errors so far.
  • Deleted ferret replica.

blancoramiro avatar Apr 13 '24 01:04 blancoramiro

Hello @mlissner.

keylime.courtlistener.com is now pointing to keystone2 replica. I was able to connect from the outside using the same django credentials. Will keep an eye on logs and monitoring. Will now start working on wsj2 to shut it down. And will set up the replication for onelegal.

blancoramiro avatar Apr 15 '24 22:04 blancoramiro

Cool. Customer contacted for confirmation of deletion.

mlissner avatar Apr 15 '24 22:04 mlissner

Update from today:

wsj2 was deleted. (Also the publication and it's slots, and also the entry for the composite alarlm and the storage alarm) Just to be sure I doubled accessing wasabi.courtlistener.com using the same credentials and also made sure that replication was working properly.

  • We are still waiting for confirmation to delete keystone
  • onelegal2 is still catching up
  • Pending instance is compass which I expect to start working on today(Wednesday)

blancoramiro avatar Apr 17 '24 02:04 blancoramiro

I'll give keystone another day or two and then bug the client again.

mlissner avatar Apr 17 '24 04:04 mlissner

Hey @mlissner

onelegal is all synced up now. We can do the swap whenever you think it is appropiate.

Just created the PR for the new compass instance. I can merge whenever we feel it's ok to fire that instance.

Thank you!!

blancoramiro avatar Apr 17 '24 22:04 blancoramiro

You can go ahead with the swap for onelegal.

mlissner avatar Apr 17 '24 23:04 mlissner