featurehub icon indicating copy to clipboard operation
featurehub copied to clipboard

Constraint violation when creating admin group on initial setup (with OAuth2)

Open mouchar opened this issue 2 years ago • 23 comments

Describe the bug Initial setup fails on constraint violation (represented as HTTP error 500 on UI) while trying to create org_admin group. Since the initial setup is not completed, the group is not created, the first user is not added to it and when user logs in, it has absolutely no permissions and no possibility to fix that.

Which area does this issue belong to?

  • [ ] FeatureHub Admin Web app
  • [ ] SDK
  • [ ] SDK examples
  • [ ] Documentation
  • [x] Other (management-repository)

To Reproduce Steps to reproduce the behavior:

  1. deploy FeatureHub with the following settings (only relevant settings are shown for brevity):
  • oauth2.providers.google.* (Google oauth2 credentials)
  • oauth2.providers: oauth2-google (other oauth2 providers are possible also affected)
  1. Navigate to MR UI
  2. Fill in some Organization and Portfolio
  3. Form is submitted and page /mr-api/initialize returns 500 Internal Server Error
  • auth.disable-login: "true" (only oauth2 login is allowed, even for the first user)

Expected behavior Featurehub can be successfully initialized.

Screenshots

  • With auth.disable-login: true:
io.ebean.DataIntegrityException: Error[ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint   Detail: Failing row contains (71bdb0ee-244a-41ee-b5c9-023987542c16, null, null, null, t, d45fe326-a6be-42f1-8364-a3626182c5f0, org_admin, 2022-06-29 14:41:54.396+00, 2022-06-29 14:41:54.396+00, 1).]
	io.ebean.config.dbplatform.SqlCodeTranslator.translate(SqlCodeTranslator.java:74) ~[ebean-api-12.15.0.jar:?]
	io.ebean.config.dbplatform.DatabasePlatform.translate(DatabasePlatform.java:245) ~[ebean-api-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:77) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.insert(DmlBeanPersister.java:46) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeInsert(PersistRequestBean.java:1221) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNow(PersistRequestBean.java:743) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNoBatch(PersistRequestBean.java:791) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeOrQueue(PersistRequestBean.java:782) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:470) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:419) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.save(DefaultPersister.java:403) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1635) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1627) ~[ebean-core-12.15.0.jar:?]
	io.featurehub.db.services.GroupSqlApi.saveGroup(GroupSqlApi.java:504) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.db.services.GroupSqlApi.createOrgAdminGroup(GroupSqlApi.java:199) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.resources.oauth2.OAuth2MRAdapter.createUser(OAuth2MRAdapter.java:127) ~[mr-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.resources.oauth2.OAuth2MRAdapter.successfulCompletion(OAuth2MRAdapter.java:82) ~[mr-1.1-SNAPSHOT.jar:?]
	io.featurehub.web.security.oauth.OauthResource.token(OauthResource.kt:62) ~[security-oauth-1.1-SNAPSHOT.jar:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
	org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:176) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [jersey-container-grizzly2-http-3.0.3.jar:?]
	io.featurehub.jersey.DelegatingHandler.service(DelegatingHandler.kt:76) [common-web-1.1-SNAPSHOT.jar:?]
	org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-3.0.1.jar:3.0.1]
	java.lang.Thread.run(Unknown Source) [?:?]
	Caused by: org.postgresql.util.PSQLException: ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint
  Detail: Failing row contains (71bdb0ee-244a-41ee-b5c9-023987542c16, null, null, null, t, d45fe326-a6be-42f1-8364-a3626182c5f0, org_admin, 2022-06-29 14:41:54.396+00, 2022-06-29 14:41:54.396+00, 1).
	org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134) ~[postgresql-42.3.3.jar:42.3.3]
	io.ebean.datasource.pool.ExtendedPreparedStatement.executeUpdate(ExtendedPreparedStatement.java:130) ~[ebean-datasource-7.3.jar:?]
	io.ebeaninternal.server.type.DataBind.executeUpdate(DataBind.java:100) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.InsertHandler.execute(InsertHandler.java:106) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlHandler.executeNoBatch(DmlHandler.java:87) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:69) ~[ebean-core-12.15.0.jar:?]
  • With auth.disable-login: false:
io.ebean.DataIntegrityException: Error[ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint   Detail: Failing row contains (9dfc43a6-329b-4f23-916e-02983867b03d, null, null, null, t, 75bd4fe5-457b-4998-ba6e-053c71ae7e49, org_admin, 2022-06-29 15:53:43.035+00, 2022-06-29 15:53:43.035+00, 1).]
	io.ebean.config.dbplatform.SqlCodeTranslator.translate(SqlCodeTranslator.java:74) ~[ebean-api-12.15.0.jar:?]
	io.ebean.config.dbplatform.DatabasePlatform.translate(DatabasePlatform.java:245) ~[ebean-api-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:77) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.insert(DmlBeanPersister.java:46) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeInsert(PersistRequestBean.java:1221) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNow(PersistRequestBean.java:743) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeNoBatch(PersistRequestBean.java:791) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.PersistRequestBean.executeOrQueue(PersistRequestBean.java:782) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:470) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.insert(DefaultPersister.java:419) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.DefaultPersister.save(DefaultPersister.java:403) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1635) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.core.DefaultServer.save(DefaultServer.java:1627) ~[ebean-core-12.15.0.jar:?]
	io.featurehub.db.services.GroupSqlApi.saveGroup(GroupSqlApi.java:504) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.db.services.GroupSqlApi.createOrgAdminGroup(GroupSqlApi.java:199) ~[mr-db-sql-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.resources.SetupResource.setupSiteAdmin(SetupResource.java:176) ~[mr-1.1-SNAPSHOT.jar:?]
	io.featurehub.mr.api.SetupServiceDelegator.setupSiteAdmin(SetupServiceDelegator.java:28) ~[mr-api-java-server-1.1-SNAPSHOT.jar:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
	org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:124) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:167) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:79) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:475) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:397) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) ~[jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:255) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.0.3.jar:?]
	org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:234) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:684) [jersey-server-3.0.3.jar:?]
	org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:356) [jersey-container-grizzly2-http-3.0.3.jar:?]
	io.featurehub.jersey.DelegatingHandler.service(DelegatingHandler.kt:76) [common-web-1.1-SNAPSHOT.jar:?]
	org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-3.0.1.jar:3.0.1]
	org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-3.0.1.jar:3.0.1]
	java.lang.Thread.run(Unknown Source) [?:?]
	Caused by: org.postgresql.util.PSQLException: ERROR: null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint
  Detail: Failing row contains (9dfc43a6-329b-4f23-916e-02983867b03d, null, null, null, t, 75bd4fe5-457b-4998-ba6e-053c71ae7e49, org_admin, 2022-06-29 15:53:43.035+00, 2022-06-29 15:53:43.035+00, 1).
	org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:166) ~[postgresql-42.3.3.jar:42.3.3]
	org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:134) ~[postgresql-42.3.3.jar:42.3.3]
	io.ebean.datasource.pool.ExtendedPreparedStatement.executeUpdate(ExtendedPreparedStatement.java:130) ~[ebean-datasource-7.3.jar:?]
	io.ebeaninternal.server.type.DataBind.executeUpdate(DataBind.java:100) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.InsertHandler.execute(InsertHandler.java:106) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlHandler.executeNoBatch(DmlHandler.java:87) ~[ebean-core-12.15.0.jar:?]
	io.ebeaninternal.server.persist.dml.DmlBeanPersister.execute(DmlBeanPersister.java:69) ~[ebean-core-12.15.0.jar:?]

Versions

  • FeatureHub version [1.5.7]
  • SDK version [N/A]
  • OS: [k8s 1.21 on Amazon Linux, if that matters]
  • Browser [any]

You can get the version of the FeatureHub container by running docker ps command. Please include the OS and what version of the OS and Docker you're running.

Additional context The value of auth.disable-login doesn't play much role when Oauth2 is configured. When set to false, the setup form asks for username and password, but the user is created without a password anyway and therefore can not be used for local login (because of NPE in io.featurehub.db.password.PasswordSalter.validatePassword(PasswordSalter.java:68)).

Add any other context about the problem here.

mouchar avatar Jun 29 '22 16:06 mouchar

HI there, just to get a handle on what you are doing I'll need to duplicate your setup, we do use OAuth2 with Google ourselves so setting up shouldn't be an issue.

The downside of failing to setup the first time is that you must empty the database schema and start again, you cannot fail and then recover. I think there is a bit in the docs about this, but the way it works is that if auth-disable is true, it will create the organisation/portfolio/etc and then redirect to your OAuth2 provider and that process must complete successfully, otherwise you are left with an unrecoverable system. We haven't really come up with a good solution to this as no-one seems to have raised it before, but I'm not sure if thats the situation you found yourself in first up... I will use the k8s config we have in featurehub-installs and set it up for oauth2 with google and disable login and see how we get on.

We are adding SAML2 with this release so its doubly important that if we have some issue we fix it as 1.5.8 is almost out the door :-)

rvowles avatar Jun 29 '22 21:06 rvowles

I'll try to prepare a minimal reproducer. Instead of embedded postgreql 11 subchart, we are using bitnami/postgresql-ha (with pg 14.2) installed separately into the same namespace. It works fine except for the initial setup. Let's see whether it works with embedded chart or not.

mouchar avatar Jun 30 '22 07:06 mouchar

OK, so the problem is a race condition happening when I use bitnami/postgresql-ha as a backend, instead of bitnami/postgresql chart (without replication and load balancing support).

The postgresql-ha works so that it contains so-called pgpool that sends DML and DDL operations to primary db and distributes DQL operations (SELECTs) evenly among all nodes (both primary and replicas). There's asynchronous replication from the primary node to all replicas.

It means that when the first user is inserted to fh_person, chances are that the subsequent SELECT from fh_person will target some replica before the previous change was propagated and the newly created user is not found. Therefore, the fk_person_who_created is set to NULL, and constraint violation occurs.

As it is actually unfortunate inefficiency of Ebean ORM, that it needs to query database for the entity it just created one millisecond ago, I'm closing this issue.

Until solved in the Ebeam ORM, we can't reliably install Featurehub on an asynchronously replicated database :(

mouchar avatar Jun 30 '22 09:06 mouchar

As it is actually unfortunate inefficiency of Ebean ORM, that it needs to query database for the entity it just created one millisecond ago, I'm closing this issue.

Hi, I'm the creator of Ebean ORM. What are you talking about? Ebean doesn't need to query the database after an insert? Are you referring to GetGeneratedKeys ?

Can you be more specific as to what this scenario is?

rbygrave avatar Jun 30 '22 09:06 rbygrave

hey @rbygrave - you got some watch on the word ebean? 😂

I've got a setup for the HA setup and its mentioned in the docs that you can use HA, but you can't tell ebean to use the read replica because read-only transactions will go to the read replica, and that delay can cause us issues in the UI. As long as you don't configure the HA in ebean itself, MR will operate correctly - but thats on how we use ebean, its not an ebean issue.

I want to keep this open because I am encountering an issue where the front end isn't honouring the redirect it is getting from the backend when running locally and I'm not sure why. I need to figure it out before we release 1.5.8.

rvowles avatar Jun 30 '22 09:06 rvowles

I'm watching feature-hub of course :)

https://github.com/featurehub-io/featurehub/blob/main/backend/mr/src/main/java/io/featurehub/mr/resources/SetupResource.java#L169-L176

My first glance says the Person is created at line 170 ... but the Person isn't returned from that call? To me that looks suspicious as to why person is created there but that method isn't returning the created person [edit: or returning the created persons id value]. (not that I've looked at the code, it's just a first glance).

rbygrave avatar Jun 30 '22 10:06 rbygrave

Ok, so I finally figured out that the pvc on the basic postgres install is set to retain policy - so my setup with local user id was all setup and once i deleted that, the oauth2 stuff worked as expected with the non-HA postgres database.

The OAuth2 stuff never creates a person via setup if it is being used, only the organisation and portfolio exist, the OAuth2 process once it completes recognizes the first person should be the admin and gives them the permissions. So something else is happening that would be great to investigate!

@mouchar do you have a config you could share for kube for the HA variant? You can see I was using this one https://github.com/featurehub-io/featurehub-install/blob/master/docker-compose-options/mr-edge-rest-postgres/docker-compose.yaml for Docker Compose but i'd rather not do that if Postgres has a nice k8s one.

I'd like to fold it into 1.5.8 for you can provide something for me?

rvowles avatar Jun 30 '22 10:06 rvowles

fk_person_who_created (is null on insert)

Always nice to reproduce but it does sound like a race condition with replication lag (so we need replication lag to reproduce). It sounds like there is an insert followed by a select and I'm assuming that is used to populate fk_person_who_created ... and I hence I believe the fix for this is to not execute the query after the insert but instead using a reference bean. That is, if I was pointed to the code that does the insert and code that does the select I could probably suggest the fix and it would possibly be fairly obvious once we see it (which would be to use a reference bean rather than execute the select from person after insert).

Apologies if this isn't helpful, I am just theorizing based only on looking at the stack trace and understanding there is replication lag. If I was pointed to the insert and select code in question though I could confirm.

rbygrave avatar Jun 30 '22 20:06 rbygrave

Hi guys, first of all, thank you for all the effort you're investing in debugging this evasive bug.

I prepared a simple reproducer in this gist. I tried it many times and this error doesn't happen all the time. If I decreased the number of k8s worker nodes, I managed to initialize FeatureHub successfully in ~80% of cases.

I also logged SQL statements, here is the log of the problematic part. I just removed a lot of noise coming from repmgr and k8s probes. The log starts on the first user check:

postgresql-ha-postgresql-2 21:50:48.490 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.490 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.491 GMT [378] LOG:  execute <unnamed>: select t0.id from fh_person t0 limit 1
postgresql-ha-postgresql-2 21:50:48.491 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.491 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.492 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.492 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.493 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.password_alg, t0.when_last_authenticated, t0.name, t0.email, t0.password, t0.password_requires_reset, t0.token, t0.token_expiry, t0.when_archived, t0.when_updated, t0.when_created, t0.version, t0.who_changed_id, t0.fk_person_who_created from fh_person a0 where t0.email = $1
postgresql-ha-postgresql-2 21:50:48.493 GMT [378] DETAIL:  parameters: $1 = '[email protected]'
postgresql-ha-postgresql-0 21:50:48.493 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.493 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.495 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.495 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.496 GMT [610] LOG:  execute <unnamed>: insert into fh_person (id, password_alg, when_last_authenticated, name, email, password, password_requires_reset, token, token_expiry, when_archived, when_updated, when_created, version, who_changed_id, fk_person_who_created) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15)
postgresql-ha-postgresql-0 21:50:48.496 GMT [610] DETAIL:  parameters: $1 = 'f25e8692-5085-45b5-b5f7-2b3a23245ae0', $2 = 'PBKDF2WithHmacSHA512', $3 = NULL, $4 = 'Robert Moucha', $5 = '[email protected]', $6 = NULL, $7 = 'f', $8 = '656a154f-4b26-4edd-8a8d-45056307bb56', $9 = '2022-07-07 21:50:48.494747+00', $10 = NULL, $11 = '2022-06-30 21:50:48.495+00', $12 = '2022-06-30 21:50:48.495+00', $13 = '1', $14 = NULL, $15 = NULL
postgresql-ha-postgresql-0 21:50:48.497 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.497 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.505 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.505 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.505 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.password_alg, t0.when_last_authenticated, t0.name, t0.email, t0.password, t0.password_requires_reset, t0.token, t0.token_expiry, t0.when_archived, t0.when_updated, t0.when_created, t0.version, t0.who_changed_id, t0.fk_person_who_created from fh_person t0 where t0.email = $1
postgresql-ha-postgresql-2 21:50:48.505 GMT [378] DETAIL:  parameters: $1 = '[email protected]'
postgresql-ha-postgresql-0 21:50:48.506 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.506 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.507 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.507 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.508 GMT [378] LOG:  execute S_5: select t0.id, t0.when_archived, t0.name, t0.when_updated, t0.when_created, t0.version, t0.fk_named_cache, t0.group_id from fh_organization t0
postgresql-ha-postgresql-0 21:50:48.508 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.508 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.509 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.509 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.509 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.when_archived, t0.is_admin_group, t0.group_name, t0.when_updated, t0.when_created, t0.version, t0.fk_person_who_created, t0.fk_portfolio_id, t0.fk_organization_id from fh_group t0 where t0.is_admin_group = $1 and t0.fk_portfolio_id is null and t0.fk_organization_id = $2
postgresql-ha-postgresql-2 21:50:48.509 GMT [378] DETAIL:  parameters: $1 = 't', $2 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.510 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.510 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.511 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.511 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.511 GMT [378] LOG:  execute <unnamed>: select t0.id, t0.when_archived, t0.name, t0.when_updated, t0.when_created, t0.version, t0.fk_named_cache, t0.group_id from fh_organization t0 where t0.id = $1
postgresql-ha-postgresql-2 21:50:48.511 GMT [378] DETAIL:  parameters: $1 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.512 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.512 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.513 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.513 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.514 GMT [378] LOG:  execute <unnamed>: select t0.id from fh_group t0 where t0.when_archived is null and t0.fk_portfolio_id is null and t0.fk_organization_id = $1 limit 1
postgresql-ha-postgresql-2 21:50:48.514 GMT [378] DETAIL:  parameters: $1 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.514 GMT [610] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-2 21:50:48.514 GMT [378] LOG:  execute S_1: COMMIT
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-2 21:50:48.516 GMT [378] LOG:  execute S_2: BEGIN
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] LOG:  execute <unnamed>: insert into fh_group (id, when_archived, is_admin_group, group_name, when_updated, when_created, version, fk_person_who_created, fk_portfolio_id, fk_organization_id) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] DETAIL:  parameters: $1 = '92c31a68-a9cd-4ae5-a8cd-79d440bd0c0d', $2 = NULL, $3 = 't', $4 = 'org_admin', $5 = '2022-06-30 21:50:48.515+00', $6 = '2022-06-30 21:50:48.515+00', $7 = '1', $8 = NULL, $9 = NULL, $10 = '44297691-749c-4661-b7af-8e8469c8a958'
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] ERROR:  null value in column "fk_person_who_created" of relation "fh_group" violates not-null constraint
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] DETAIL:  Failing row contains (92c31a68-a9cd-4ae5-a8cd-79d440bd0c0d, null, null, null, t, 44297691-749c-4661-b7af-8e8469c8a958, org_admin, 2022-06-30 21:50:48.515+00, 2022-06-30 21:50:48.515+00, 1).
postgresql-ha-postgresql-0 21:50:48.516 GMT [610] STATEMENT:  insert into fh_group (id, when_archived, is_admin_group, group_name, when_updated, when_created, version, fk_person_who_created, fk_portfolio_id, fk_organization_id) values ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)
postgresql-ha-postgresql-2 21:50:48.520 GMT [378] LOG:  execute S_6: ROLLBACK
postgresql-ha-postgresql-0 21:50:48.520 GMT [610] LOG:  execute S_6: ROLLBACK

mouchar avatar Jun 30 '22 22:06 mouchar

The time delta between insert into fh_person (done on primary postgresql-ha-postgresql-0) and select from fh_person (done on replica postgresql-ha-postgresql-2) is 9ms in this case. The rest of log is straightforward - if the select from fh_person returns zero rows (due to replication lag as I suppose), then the insert into fh_group has parameter $8 = NULL (fk_person_who_created).

While it is impossible to reliably avoid replication lag, it should be possible to avoid the select from fh_person query, especially if the Person is known because it was created literally a few lines above in the same method.

mouchar avatar Jun 30 '22 22:06 mouchar

it should be possible to avoid the select from fh_person query, especially if the Person is known because it was created literally a few lines above in the same method.

Yes, exactly. We don't want the select .. from fh_person t0 where t0.email = $1 ... and instead we should be using a reference bean when creating the group (insert into ... fh_group) regardless of replication lag.

So I think the next step is to find those 2 places in the code where the person is inserted, and where the group is inserted. We need the id value of the person inserted to be passed to the place where the group is inserted ... and remove the query for person by email and replace that with a reference bean using the passed person id value.

rbygrave avatar Jul 01 '22 00:07 rbygrave

However, thats not one request, there is a service layer issuing those request to create the user and then reissues the generic request to add groups to the created user using service layer primitives :-) If I wrapped it all in one transaction would that make a difference @rbygrave ? In an HA situation that would probably make sense for PUT, POST and DELETE requests.

rvowles avatar Jul 01 '22 00:07 rvowles

So personApi.create(email, username,null); here ... https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L114 ... that returns PersonApi.PersonToken which has the person.id value in it, so we need to capture that returned PersonToken ... and use that for the Group group = groupApi.createOrgAdminGroup(organization.getId(), "org_admin", person);

... BUT ...

Just above that we have ... Person person = authenticationApi.register(username, email, null, null); https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L122

And inside that is a select followed by an update if the row is found: https://github.com/featurehub-io/featurehub/blob/a6093bf5e8686300bf2ec24d60bb36f96d69ed49/backend/mr-db-sql/src/main/java/io/featurehub/db/services/AuthenticationSqlApi.java#L83-L113

... now if the select and update was surrounded by a transaction then that select would go against the master datasource, but it does not so that select is going to go against the "read-only" datasource (that often points to a read replica database which can have replication lag).

That is, to me it looks like there is a second issue here. This second issue is that the logic to reset the token in AuthenticationSqlApi.java#L83-L113 isn't actually working (because the select runs against a read-replica with potential replication lag). IF, we want this logic then we should wrap a transaction around that select + update. IF we did that, it would probably find the newly inserted person and work.

So it looks like there are 2 possible fixes to this issue.

rbygrave avatar Jul 01 '22 00:07 rbygrave

I literally just posted and then saw your question :) ... Yes, putting a transaction around the whole lot would mean all the select queries in that transaction are executed against the master and not the read-only datasource / read replica. So yes.

rbygrave avatar Jul 01 '22 00:07 rbygrave

However, thats not one request,

Yeah I don't quite follow that it's more than 1 request. Is it not that there is 1 request that is both inserting the person and then ultimately creating the group? This code block below is doing both:

https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L113-L127

?

rbygrave avatar Jul 01 '22 00:07 rbygrave

The OAuth2Adapter is making a request to the database layer which is in one transaction, and then does the groups which is another transaction - where you see personApi, groupApi, portfolioApi - those are all not in the same transaction. HA isn't something we have spent any time on, so if we wrap the appropriate steps (in this case, groups and person) in the same transaction I feel this would solve this specific issue?

I'm still not fully convinced, I ran into this issue with read replicas when issuing a POST and a GET request - the delay was sufficiently long that the GET would fail, and as you can't ensure the replica a person is talking to, you are always going to get these issues. If there was some way of ideally assigning a person based on some method to a particular connection so their requests always went to the same place, that would make it more reliable.

rvowles avatar Jul 01 '22 01:07 rvowles

so if we wrap the appropriate steps (in this case, groups and person) in the same transaction I feel this would solve this specific issue?

Yes agreed. That should fix it.

... with the caveat that, it is doing insert person, find person, update person, insert group so that isn't "optimal" per say vs just insert person, insert group but this does sound like a rare use case so I suspect that optimizing this to be just insert person, insert group isn't a priority.

when issuing a POST and a GET request

Yeah but I don't see the 2 http requests for this case? I see one request calling the single method https://github.com/featurehub-io/featurehub/blob/v1.5.7/backend/mr/src/main/java/io/featurehub/mr/resources/oauth2/OAuth2MRAdapter.java#L107-L143 ... and inside this method its doing what results in insert person, find person, update person, insert group ? ... and because there is not a transaction wrapping those the find person is using the read-only datasource / read replica (with the lag) ?

An option might be to turn on Ebean SQL and TXN logging and along with threadId it might be more obvious if this is multiple http requests ?

rbygrave avatar Jul 01 '22 02:07 rbygrave

FWIW AWS Aurora Postgres lag is expected to be less than 200ms and I've confirmed that was actually the case. AWS Aurora MySQL replication lag when I measured it in reality could be seconds !!! and this made it Aurora MySQL read replicas much much less useful/practical.

rbygrave avatar Jul 01 '22 02:07 rbygrave

2 HTTP requests happen when - for example someone updates a set of features which is one state on one page, and we issue a subsequent GET to update the summary page again, the different between the commit for the updates happening and the GET to get the summary data can happen in < 50ms, particularly when running locally.

rvowles avatar Jul 01 '22 03:07 rvowles

Yeah but that isn't this specific issue right? In this case there is only the 1 http request right?

rbygrave avatar Jul 01 '22 03:07 rbygrave

No, its not this specific issue :-)

rvowles avatar Jul 01 '22 03:07 rvowles

And there's one more practical side effect of having all this wrapped in TXN. When things go south for any reason during the initial user/group setup, everything is nicely rolled back and you don't end up with inconsistent DB with an existing user and without groups.

Then, people can simply retry the action and hope for a better outcome :)

mouchar avatar Jul 01 '22 14:07 mouchar

Yeah, so I dropped a PR that fixed this specific issue and ran it up and immediately hit other issues. One significant one is that we need to make sure feature updates are committed and in the database before we publish them. This leads to obvious problems as that async publish happens immediately after save and as its in a different transaction, it can go to a different database. That means I have to make sure the record is there by polling for it before i try and publish it, which is really pretty nasty. Suggestions are welcome!

rvowles avatar Jul 01 '22 21:07 rvowles