graylog2-server
graylog2-server copied to clipboard
Add unique index for users.username
What?
Add a unique index for the users.username
field.
Why?
At least two instances have occurred when customers were upgrading to Graylog 4.3.2
where duplicate users (with the same username) were found and resulted in failed migrations.
For example for the graylog-sidecar
user:
2022-06-03T10:22:18.961-05:00 ERROR [UserServiceImpl] There was more than one matching user for username graylog-sidecar. This should never happen.
2022-06-03T10:22:18.961-05:00 WARN [ServerBootstrap] Exception while running migrations
java.lang.RuntimeException: There was more than one matching user for username graylog-sidecar. This should never happen.
at org.graylog2.users.UserServiceImpl.load(UserServiceImpl.java:175) ~[graylog.jar:?]
at org.graylog2.migrations.MigrationHelpers.ensureUser(MigrationHelpers.java:98) ~[graylog.jar:?]
at org.graylog.plugins.sidecar.migrations.V20180323150000_AddSidecarUser.upgrade(V20180323150000_AddSidecarUser.java:60) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.lambda$runMigrations$0(ServerBootstrap.java:263) ~[graylog.jar:?]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:422) ~[graylog.jar:?]
at com.google.common.collect.RegularImmutableSortedSet.forEach(RegularImmutableSortedSet.java:8
This also happened for another user during a 4.2.9
-4.3.2
upgrade for the Forwarder system user, which is added in this migration:
https://github.com/Graylog2/graylog-plugin-enterprise/blob/5ea3908413513f3a344d936a0d86780572b7776e/enterprise/src/main/java/org/graylog/plugins/forwarder/migrations/V20200918000000_AddForwarderUser.java#L40-L47
Your Environment
- Graylog Version: 4.3.2
- Elasticsearch Version:
- MongoDB Version: 4.2.20
- Operating System:
- Browser version:
@bernd @kroepke @mpfz0r @thll Do you have any idea how the duplicate users could be coming about? The migrations use the following method to ensure the user exists:
https://github.com/Graylog2/graylog2-server/blob/642342e2fd2b5e27db76b5ebd09db624ab95c1b9/graylog2-server/src/main/java/org/graylog2/migrations/MigrationHelpers.java#L88
As far as I know, in the two instances this has happened, it was for smaller version upgrades, and these system users should have already existed.
In the absence of understanding how the duplicate users are being created, perhaps we can add a migration to add a unique index to the username field to at least prevent the duplicate user from being created. Thoughts?
Do you have any idea how the duplicate users could be coming about?
@danotorrey Do we have more logs? The stack trace above happens when the duplicate user already exists. It would be interesting to see the logs from the startup when the duplicate user got created.
But I also don't see how the code would allow the duplicate creation of the user. Perhaps it happens in some kind of race condition if there are two leader nodes in the cluster or when there are issues with the MongoDB connection.
We could improve the code to do an upsert instead of a fetch and update to avoid potential race conditions.
In the absence of understanding how the duplicate users are being created, perhaps we can add a migration to add a unique index to the username field to at least prevent the duplicate user from being created. Thoughts?
That sounds like a good idea. I am unsure if we tried that at some point in the past and failed. The problem is that the index creation fails if the collection already contains duplicate entries. That said, I can't think of an issue enabling the unique index on the username right now. The unique index is case-sensitive by default, which we need. The user service throws an error when loading a user that is duplicated. So it should be safe. Can you think of anything @mpfz0r?
@bernd I am checking if the previous logs are available from the two recent reported instances. I also initially thought that the most likely cause was a race condition. Probably two parallel executions of the MigrationHelpers.ensureUser()
method. Both get past the userService.load(userName)
on line 98, and then both create the user.
But, the only part that does not make sense to me is, both instances of the error were 4.2.9
-> 4.3.x
upgrade scenarios. Those users should have existed already (both the forwarder user and the sidecar user). How would the userService.load(userName)
not return the user in that case? It does not fit with the race condition theory.
https://github.com/Graylog2/graylog2-server/blob/354afa04f96b250d583e41f7e4af7121dfb2e00b/graylog2-server/src/main/java/org/graylog2/migrations/MigrationHelpers.java#L98-L125
I think we should keep digging a bit more to find at least a possible scenario that could cause the duplicates. If it turns out to be a race condition, then I am leaning more towards making the ensureUser
method more robust (e.g. use upsert
instead) than the unique index. Making that method more robust would more specifically target the cause of the issue, vs. just protecting against the symptom (the unique index).
But, the only part that does not make sense to me is, both instances of the error were
4.2.9
->4.3.x
upgrade scenarios. Those users should have existed already (both the forwarder user and the sidecar user). How would theuserService.load(userName)
not return the user in that case? It does not fit with the race condition theory.
@danotorrey Good point! :+1: I didn't think of that.
I think we should keep digging a bit more to find at least a possible scenario that could cause the duplicates.
Sounds good, thank you! :pray:
Hello,
just my 2 cents as we stumbled on errors (different than the ones above) while upgrading to 4.3, too.
In my opinition the duplication of the users was in prior upgrade and in 4.3 there is a change which stops migrations on errors.
In our instance for example this migration exited on error, but was added way before 4.3:
2022-06-27 21:17:01,437 WARN : org.graylog2.bootstrap.ServerBootstrap - Exception while running migrations
org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException: An error occurred:
at org.graylog.storage.elasticsearch7.ElasticsearchClient.exceptionFrom(ElasticsearchClient.java:151) ~[?:?]
at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:111) ~[?:?]
at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:104) ~[?:?]
at org.graylog.storage.elasticsearch7.views.migrations.V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.addGl2MessageIdFieldAlias(V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.java:53) ~[?:?]
at org.graylog.plugins.views.migrations.V20200730000000_AddGl2MessageIdFieldAliasForEvents.upgrade(V20200730000000_AddGl2MessageIdFieldAliasForEvents.java:76) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.lambda$runMigrations$0(ServerBootstrap.java:263) ~[graylog.jar:?]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:422) ~[graylog.jar:?]
at com.google.common.collect.RegularImmutableSortedSet.forEach(RegularImmutableSortedSet.java:88) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.runMigrations(ServerBootstrap.java:261) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.startCommand(ServerBootstrap.java:187) [graylog.jar:?]
at org.graylog2.bootstrap.CmdLineTool.run(CmdLineTool.java:311) [graylog.jar:?]
at org.graylog2.bootstrap.Main.main(Main.java:45) [graylog.jar:?]
Caused by: org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchStatusException: Elasticsearch exception [type=mapper_parsing_exception, reason=Invalid [path] value [id] for field alias [gl2_message_id]: an alias must refer to an existing field in the mappings.]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:187) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:1892) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:1869) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1626) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1598) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1565) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.IndicesClient.putMapping(IndicesClient.java:353) ~[?:?]
at org.graylog.storage.elasticsearch7.views.migrations.V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.lambda$addGl2MessageIdFieldAlias$2(V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.java:53) ~[?:?]
at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:109) ~[?:?]
... 10 more
Suppressed: org.graylog.shaded.elasticsearch7.org.elasticsearch.client.ResponseException: method [PUT], host [http://9200], URI [/gl-events*,gl-system-events*/_mapping?master_timeout=30s&ignore_unavailable=true&expand_wildcards=open%2Cclosed&allow_no_indices=true&ignore_throttled=false&timeout=30s], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Invalid [path] value [id] for field alias [gl2_message_id]: an alias must refer to an existing field in the mappings."}],"type":"mapper_parsing_exception","reason":"Invalid [path] value [id] for field alias [gl2_message_id]: an alias must refer to an existing field in the mappings."},"status":400}
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestClient.convertResponse(RestClient.java:302) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestClient.performRequest(RestClient.java:272) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestClient.performRequest(RestClient.java:246) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1613) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1598) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1565) ~[?:?]
at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.IndicesClient.putMapping(IndicesClient.java:353) ~[?:?]
at org.graylog.storage.elasticsearch7.views.migrations.V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.lambda$addGl2MessageIdFieldAlias$2(V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.java:53) ~[?:?]
at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:109) ~[?:?]
at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:104) ~[?:?]
at org.graylog.storage.elasticsearch7.views.migrations.V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.addGl2MessageIdFieldAlias(V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.java:53) ~[?:?]
at org.graylog.plugins.views.migrations.V20200730000000_AddGl2MessageIdFieldAliasForEvents.upgrade(V20200730000000_AddGl2MessageIdFieldAliasForEvents.java:76) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.lambda$runMigrations$0(ServerBootstrap.java:263) ~[graylog.jar:?]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:422) ~[graylog.jar:?]
at com.google.common.collect.RegularImmutableSortedSet.forEach(RegularImmutableSortedSet.java:88) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.runMigrations(ServerBootstrap.java:261) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.startCommand(ServerBootstrap.java:187) [graylog.jar:?]
at org.graylog2.bootstrap.CmdLineTool.run(CmdLineTool.java:311) [graylog.jar:?]
at org.graylog2.bootstrap.Main.main(Main.java:45) [graylog.jar:?]
As seen in git history the file and so the migration is very old, but did not error out on startup prior to 4.3: https://github.com/Graylog2/graylog2-server/commits/master/graylog-storage-elasticsearch7/src/main/java/org/graylog/storage/elasticsearch7/views/migrations/V20200730000000_AddGl2MessageIdFieldAliasForEventsES7.java
The migration behaviour was changed with this PR to exit(1) on error: https://github.com/Graylog2/graylog2-server/pull/11526
Previously failed migrations had no impact on running graylog-server
EDIT: Discovered the ignore_migration_failures
config value, which let us proceed with our upgrade.
Has there been any movement on this? We're facing a similar issue while upgrading 4.2.x -> 4.3.x. Our master node is failing to start due to duplicate user error:
java.lang.RuntimeException: There was more than one matching user for username graylog-sidecar. This should never happen.
at org.graylog2.users.UserServiceImpl.load(UserServiceImpl.java:175) ~[graylog.jar:?]
at org.graylog2.migrations.MigrationHelpers.ensureUser(MigrationHelpers.java:98) ~[graylog.jar:?]
at org.graylog.plugins.sidecar.migrations.V20180323150000_AddSidecarUser.upgrade(V20180323150000_AddSidecarUser.java:60) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.lambda$runMigrations$0(ServerBootstrap.java:264) ~[graylog.jar:?]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:422) ~[graylog.jar:?]
at com.google.common.collect.RegularImmutableSortedSet.forEach(RegularImmutableSortedSet.java:88) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.runMigrations(ServerBootstrap.java:261) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.startCommand(ServerBootstrap.java:187) [graylog.jar:?]
at org.graylog2.bootstrap.CmdLineTool.run(CmdLineTool.java:312) [graylog.jar:?]
at org.graylog2.bootstrap.Main.main(Main.java:45) [graylog.jar:?]
I think we should keep digging a bit more to find at least a possible scenario that could cause the duplicates. If it turns out to be a race condition, then I am leaning more towards making the ensureUser method more robust (e.g. use upsert instead) than the unique index. Making that method more robust would more specifically target the cause of the issue, vs. just protecting against the symptom (the unique index).
+1
When digging for possible causes of the bug, we should focus on changes that happened between 4.2.x
-> 4.3.x
.
Also, while both the upsert and making the index unique makes sense, we should not forget that we have to resolve the duplicate users first. We should try and keep the user that has a token assigned, because that's likely the one that is in use. If both have tokens, we can only rename one of them.
What I have gleaned from the code:
-
UserService.save
already performs an upsert viaPersistedServiceImpl.save
. No need to change anything there. - Access tokens are mapped to users via
username
. If there are duplicate users, then that mapping is also ambiguous. I.e. it doesn't matter which of the duplicates is renamed.