orientdb icon indicating copy to clipboard operation
orientdb copied to clipboard

WIP - Improved HA stability

Open timw opened this issue 3 years ago • 5 comments

What does this PR do?

This is an in-progress set of changes we're working on to increase the stability of OrientDB in HA/distributed deployments. We're currently testing these fixes in pre-production and then moving them to production, but are opening them up now for discussion to see if there are potentially better ways to fix these issues, and allow broader testing to see if they introduce other issues (as we don't exhaustively test all features/APIs).

Motivation

We have encountered multiple outages in production due to various stability issues related to distributed operation. These issues are numerous and somewhat interacting, and required the development of a dedicated stress test suite to replicate in development.

A full list of the unique issues encountered during testing is too long to enumerate, covering many (many) different exceptions, but a general overview of the core issues follows:

  • there is no way to remotely (via API) verify the HA status of the servers/databases in the community edition. This is remedied by a small update to the HA STATUS -servers query to effectively do what the enterprise agent APIs can achieve.
  • database opening is not coordinated with the distributed plugin status, allowing some internal structures (notably shared context construction) to be initialised in the window between startup and distributed plugin startup completing if there is a client transaction load on the database during node startup, resulting in inconsistent initialisation and subsequent transaction failures.
  • database opening does not effectively check that the distributed server/database status was online, resulting in access to inconsistent/uninitialised state and subsequent exceptions during transactions.
  • remote full database install sometimes fails with corrupted zip streams. We suspect this is related to a stream copying bug, which is patched and seems to have fixed that issue.
  • construction of the DistributedDatabaseImpl was not thread-safe, and allowed uninitialised object structures to be accessed, causing NPEs at runtime.
  • database creation suffers from similar issues to database opening when load exists on the server, so checks have been added to prevent database creation before the HA status is stable.

Related issues

There's a patch in this PR that artificially widens the distributed plugin startup time - we found that this allows easier reproduction of the production issues we observe. The cause of this is that the distributed plugin makes TCP connections to remote nodes during startup, which in our production case is cross AZ in AWS and thus has higher latencies than a local data-centre, which increases the window in which some of the startup issues occur.

During testing/resolving these issues, multiple enhancements were made to logging/tracing:

  • a facility to name and trace async tasks has been added, allowing failed background tasks to be identified precisely in logs to make debugging error conditions much easier.
  • the log format used when an exception is logged is corrected - currently it omits the leading newline and the formatting, which causes error messages to be smashed into the preceding log message.
  • various individual cases of incorrect logging of errors were fixed.

The stress test tool we have developed is now available at [https://github.com/indexity-io/orientdb-stress]. It's open source licensed, but we've kept it distinct from OrientDB as it needs to run across multiple versions/patches and that doesn't work well in-tree. It currently requires some of the patches in this branch to run successfully.

Additional Notes

There is an additional class of issues that this branch does not currently fix, which is related to storage closures during database installation while in-flight transactions have already opened the database. This causes transaction errors due to closed in-memory and on-disk structures, and often leads to cascading database installs, failed updates and (in rarer situations) lost updates. We have some fixes designed for this issue, but are debating whether it's worth developing them further as they are not observed with the enterprise agent deployed (the full database sync in the enterprise edition does not close storage for backup/remote install, and so does not encounter these problems).

I've ported these changes to 3.2 and tested to the point that I'm fairly confident that they can be reproduced in that branch and solve the same issues - 3.2 already had some changes that 3.1 did not have that try to address some of these issues, but fail under stress testing without the fixes in this branch. I've paused that work for now until the 3.1 changes can be discussed and made stable. There are additional issues in 3.2 that will need to be addressed (creating databases currently fails in a distributed cluster soon after startup) that cause problems for the stress test tool, and 3.2 also suffers from the issues with database storage closure under load (in particular I've observed lost updates on some occasions).

Checklist

[x] I have run the build using mvn clean package command [x] My unit tests cover both failure and success scenarios

timw avatar Aug 14 '22 23:08 timw

@timw if OrientDB supported database Alias, full database restoration would be made easier: 1-Restore the desired backup (using the enterprise plugin) into a temporary database, with a temporary alias 2- Freezing the current database 3- Swapping the database Alias between the restored and the Current database 4- release the database 5- drop the previous database as soon as the backup restoration is confirmed

This means that a database could be accessed through it's alias name. Under the enterprise plugin, Just by using a command such as: ALTER DATABASE set ALIAS=<'alias name'>; would define a database alias. Otherwise the alias would have the same name as the database (by default would guarantee compatibility).

Swapping database ALIAS between database, would be the most effective way to restore a database. As the temporary database could be tested before being accepted or revoked.

In a distributed mode, the restored database should be propagated across the nodes.

What are your thoughts about this? Currently, with the Enterprise edition plugin, we need to restart the application server, which means that users would loose information and the application would be unavailable for a period of time.

Another thing: We have been struggling to run the database under load balance configuration (Round Robin). Have you experienced the same issue?

Kind regards,

Joao

joao-carimo avatar Aug 16 '22 16:08 joao-carimo

Hi @timw,

Your observations and fixes are spot on, this work looks amazing, there are additional refactor as well that we are slowly working on on our side to make all more stable, like unify the ODistributedDatabaseImpl lifecycle with the OSharedContext lifecycle, that should help to make sure opening/closing/installing of a database would work in a more reliable way!

In the specific of this pull request, has a lot of changes inside, and probably will be better apply some of the changes one by one, to reduce risk and complexity, so if you can could you please split this PR in multiple PR?

I will list some commits that I saw in this PR that could merged strait away in a independent PR:

1: https://github.com/orientechnologies/orientdb/pull/9854/commits/79e6b4df4b709a2d060919a5d38f8dd4011ae374 2: https://github.com/orientechnologies/orientdb/pull/9854/commits/6d1c5c2433ca5f708425c864a9c0de56e57fe3c7 https://github.com/orientechnologies/orientdb/pull/9854/commits/a09415f6d7944351a189e2fa70615b6bbfff1799 3: https://github.com/orientechnologies/orientdb/pull/9854/commits/4116f4230cb511c151af0e0524c061fa89173d44 4: https://github.com/orientechnologies/orientdb/pull/9854/commits/2abb66a165e7a90efc2f842e0163e3e1d33d33ac 5: https://github.com/orientechnologies/orientdb/pull/9854/commits/19c3ca7e0960d589461308b138b73c4deacc1e43 6: https://github.com/orientechnologies/orientdb/pull/9854/commits/3860d384bba23992f1845d105ef74f8a3329492f

if you could open some PR which each of them this commits, I could merge them straight away, then for the rest of the commits that require additional checking and testing we can proceed later on with review and merges, also after the merging of some of the listed commit this PR will become smaller.

Regards

tglman avatar Aug 18 '22 14:08 tglman

Happy to do some separate PRs for the independent bits (part of the reason for this WIP was to have that discussion). Part of the rationale for them was the overall context of improving the HA stability, so it's good to see some of them in-sequence.

timw avatar Aug 19 '22 03:08 timw

@tglman - I've created the separate PRs now. If/when those get merged, I'll rebase this PR.

timw avatar Aug 23 '22 23:08 timw

I've rebased on the head of 3.1.x now, with the separately merged commits removed now.

timw avatar Aug 24 '22 21:08 timw

Hi,

I saw that you update this PR with new commits, so I went through it again, here a list of commit that I would accept straight away in independent PRs

#1 https://github.com/orientechnologies/orientdb/pull/9854/commits/1694e6fcbafd726b1815db39eada928349358ed5

#2 view fixes https://github.com/orientechnologies/orientdb/pull/9854/commits/3c533200732e879e2a17200675934e383f34ff7c https://github.com/orientechnologies/orientdb/pull/9854/commits/ede8fbf52719f497cb8199296a7351372967b5fb

#3 https://github.com/orientechnologies/orientdb/pull/9854/commits/3d6d5e5853f2f6649eb2fdddfa28accbd2b34d87

#4 https://github.com/orientechnologies/orientdb/pull/9854/commits/99a0bfb0bed25109ebf031ce70a3cd161bdbbb8c

#5 https://github.com/orientechnologies/orientdb/pull/9854/commits/58991019478c1db9427dbc68db8cbdc2cda51bf3

#6 https://github.com/orientechnologies/orientdb/pull/9854/commits/4ec87b1c3fe9162ee9678d58af569ab824b76fd5

Could you send this Independently ?

Regards

tglman avatar Nov 14 '22 12:11 tglman

@tglman - sorry, I didn't notice your last update, and got distracted with other work for a while so only just checked it today. I've created the requested PRs (I note 3c533200732e879e2a17200675934e383f34ff7c has already been cherry picked).

It would also be good to get some guidance on what to do with the remaining work in this PR, once those other minor items are removed.

We've been running this in production for some months now, and have had no issues or outages, so are pretty confident on stability. 3.1 is a dormant branch though, so I'd like to start porting the fixes to 3.2, but there are some changes made in 3.2 around the database opening code paths that will require some rework.

timw avatar Mar 27 '23 09:03 timw

Hi,

Thank you to have created the specific PRs, I merged some of them and ported the changes also to 3.2.x (also all the previous merged changes have already been ported to 3.2.x).

I see in this PR are left 3 main set of changes

  • Introduction of new executors classes
  • Refactors around views scheduling + execution
  • Change of initialization and flow of the plugins specifically the distributed one

For the executors, what is the scope of this specialized executor ? looking at the code it seems that it add additional logging to make sure to correlate the error with the source caller, am I getting it right ? This is cool but is not free so I'm pro about it just maybe make the tracing possible to be turned off, also I'm happy to have this in 3.2.x even though in that version all the executors are in the OrientDB context and should be turned off with it, and all the Orient global executor have been removed, but I guess is not a problem to use this tracing executor in there as well.

for the view changes in the specific of 3.1.x I think we could merge it, but for 3.2.x there have been a big refactor of that logic, that should already resolve the problem that this changes are fixing, so I think is not needed to port views changes to 3.2.x

For the plugin loading change i see there is trying to make sure that the detection of the distributed plugin is done earlier with the "distributedPluginEnabled" flag, I understand why this is done, I had more then a few problems on database creation on initialization and detection of distributed environment at startup, this could be ok in 3.1.x but also here I think we managed to solve this problem with a more structural approach in the 3.2.x version, so I do not know if this changes is worth to be ported.

Thank you for this work anyway, it is impressively good !

tglman avatar Apr 04 '23 11:04 tglman

I'll try to find some time to fix up the tracing executors soon - as you note we can avoid the callable construction when tracing is disabled (which is already detected in the prepareTrace).

For the view and lifecycle changes, I'll need to re-test against the current 3.2 head to be sure (I see a lot of changes in the database/session/distributed area that I'll have to understand as well), but at the time of creating this PR my load test app could reliably break 3.2 in a lot of the same ways that 3.1 broke.

Given we're pretty stable on our 3.1 fork for now, the best way forward might be to re-do the stress testing on 3.2 and then port the still relevant fixes over to 3.2 and look at that in a separate PR.

timw avatar May 10 '23 09:05 timw

Hi,

I did manually port the executor tracing in 3.2.x and add a global configuration to enable it for debug purpose, the rest of the part of this PR have been already solved in 3.2.x, so I'm going to close this.

Regards

tglman avatar Jan 22 '24 12:01 tglman

Thanks for that. We will be upgrading to 3.2 mid 2024 I expect, so I’ll repeat the stress testing process in 3.2 then. I think the architectural changes make porting the remaining patches tricky, so probably best to start from 1st principles again and see if there are outstanding/new issues.

timw avatar Jan 22 '24 22:01 timw