terraform-provider-azuread
terraform-provider-azuread copied to clipboard
Error: azuread_users hangs after http 404 starting in v2.38.0
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritise this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritise the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Terraform (and AzureAD Provider) Version
Terraform v1.2.6 registry.terraform.io/hashicorp/azuread v2.38.0
Affected Resource(s)
azuread_users
Terraform Configuration Files
data "azuread_users" "users" {
ignore_missing = true
object_ids = distinct(
flatten(
[for id,g in data.azuread_group.groups: g.members]
)
)
}
Debug Output
Note: output shortened for brevity and removed id's
2023-05-01T14:10:19.676-0600 [INFO] provider.terraform-provider-azuread_v2.38.0_x5.exe: 2023/05/01 14:10:19 [DEBUG] ============================ Begin AzureAD Response =========================== GET https://graph.microsoft.com/beta/users/:id Request ID:
HTTP/1.1 404 Not Found Transfer-Encoding: chunked Cache-Control: no-cache Client-Request-Id: Content-Type: application/json Date: Mon, 01 May 2023 20:10:19 GMT Deprecation: Thu, 17 Feb 2022 23:59:59 GMT
155 {"error":{"code":"Request_ResourceNotFound","message":"Resource ':id' does not exist or one of its queried reference-property objects are not present.","innerError":{"date":"2023-05-01T20:10:19","request-id":"","client-request-id":""}}} 0
============================= End AzureAD Response ============================:
Panic Output
Expected Behavior
data attribute should have gotten the member instance of the given id in AzureAD
Actual Behavior
Terraform runs through a good chunk of different groups with nested members, eventually reaching a point where it is reading from a particular resource and doesn't finish reading for the 20 minutes and then times out. While running some traces it looks like a 404 happens when trying to read a resource and terraform doesn't know what do do about it. It doesn't appear to do any retries and the only logs after that error are that another resource is waiting for this one to finish.
Reverting the azuread provider back to 2.37.2 fixed the issue.
Steps to Reproduce
- Make sure azuread provider is at v2.38.0
- Add user members or nested groups to an AzureAD group
terraform planwith the above data attribute to retrieve the members
Important Factoids
References
Hi @NattyMojo, thanks for reporting this. During the apparent hang, Terraform is actually retrying the request it's just not logged for every retry at the moment. The reason for retrying is that this is frequently our only option to defeat errors related to eventual consistency in Azure AD. However we only want to retry where it makes sense and I'll take a look to see what we can improve here. Thanks again for reporting this case.
Thanks for taking a peek @manicminer. Glad you called out that it is retrying, because all I see in the logs after the 404 is the following
2023-05-01T14:10:22.558-0600 [TRACE] dag/walk: vertex "module.members_it_support.data.azuread_group.level_3_groups (expand)" is waiting for "module.members_it_support.data.azuread_users.level_3_users (expand)"
2023-05-01T14:10:26.952-0600 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/azuread\"] (close)" is waiting for "module.members_it_support.data.azuread_group.level_3_groups (expand)"
And it's just that over and over for about 20 minutes until the reading times out. Let me know if there's anything further I can do to assist
@NattyMojo I'm fairly confident that this is related to recent changes to retry logic, however I don't believe it's what I initially thought. In v2.38.0, we explicitly disabled retries for data sources, which logically should mean that you don't see such long timeouts when trying to read nonexistent users using the azuread_users data source.
I unfortunately haven't yet been able to reproduce the specific symptom you described of a 20 minute hang. WOuld you be abl to confirm the following?
- How many users are you trying to retrieve in one instance of the data source?
- Were some/all of these users very recently added, e.g. in the same or preceding Terraform apply operation?
- Did you specify
ignore_missing = true? - Are you seeing a Terraform error message after the hang elapses? What's the message?
Additionally, if there's any way you could provide a debug log it would help greatly. I suspect in your case the log will be lengthy, but you should be able to post it in a gist after performing a regex search/replace to remove the UUIDs should you need to. Appreciate your feedback to help get to the bottom of this. Thanks!
Hey @manicminer sorry for the delayed response. We are trying to pull ~120 user accounts, specifically in the one module that was throwing the error, we use the same thing over multiple applications and groups at the same time, so when running together it could be as many as about 500 users over multiple modules. Though they are within nested groups 3 levels deep. So we traverse from one group, get all users and groups, then follow that children set of groups and do the same (only distinct users), then take it's children groups and get all the users at that point too. Users have not been recently added to groups, these have existed for at least 6 months or longer. Yes ignore_missing is set to true.
I did try running again today and was able to get a plan made in 13ish minutes (most of which spent on the problematic section), but normally that step takes seconds at the most. When running today I did isolate to one application (one module per application) so maybe with less load on Azure it was able to complete the plan, but with more applications it could bog down more to where we are hitting those 20 minute timeouts, I'll give that a go when I can.
The error we were seeing at the end of the runs was:
Error: Retrieving user with object ID: "<removed id>"
with module.members.data.azuread_users.level_3_users,
data "azuread_users" "level_3_users" {
UsersClient.BaseClient.Get(): Get
"[https://graph.microsoft.com/beta/users/<removed id>"](https://graph.microsoft.com/beta/users/<removed id>%22):
context deadline exceeded
module.members.data.azuread_users.level_3_users: Still reading... [20m0s elapsed]
Below is the bulk of what's happening in our module. This has been working for us forever and just broke as of v2.38.0
data "azuread_group" "level_0" {
display_name = var.group_name
}
data "azuread_users" "level_1_users" {
ignore_missing = true
object_ids = data.azuread_group.level_0.members
}
data "azuread_group" "level_1_groups" {
for_each = setsubtract(
data.azuread_group.level_0.members,
[for u in data.azuread_users.level_1_users.users: u.object_id]
)
object_id = each.key
}
data "azuread_users" "level_2_users" {
ignore_missing = true
object_ids = distinct(
flatten(
[for id,g in data.azuread_group.level_1_groups: g.members]
)
)
}
data "azuread_group" "level_2_groups" {
for_each = setsubtract(
distinct(
flatten(
[for id,g in data.azuread_group.level_1_groups: g.members]
),
),
[for u in data.azuread_users.level_2_users.users: u.object_id]
)
object_id = each.key
}
# **This is where we break**
data "azuread_users" "level_3_users" {
ignore_missing = true
object_ids = distinct(
flatten(
[for id,g in data.azuread_group.level_2_groups: g.members]
)
)
}
data "azuread_group" "level_3_groups" {
for_each = setsubtract(
distinct(
flatten(
[for id,g in data.azuread_group.level_2_groups: g.members]
)
),
[for u in data.azuread_users.level_3_users.users: u.object_id]
)
object_id = each.key
}
@manicminer any thoughts as to a workaround here? we'd like to start leveraging resources added in 2.42 but can't move forward because of this reported issue
fwiw: getting same error with latest versions
C:\temp\user-data> tf -version
Terraform v1.6.4
on windows_amd64
+ provider registry.terraform.io/hashicorp/azuread v2.46.0
don't grok go myself, but looking at https://github.com/hashicorp/terraform-provider-azuread/pull/1090/files#diff-f9b9aabf70d6a2fcb30b7e6254917c65b36586d0ca8f38f6158ec125bc047e3fR160 all these defers assume the client has been locked in some way or is not shared but i don't think that's the case...isn't it then a race condition between the an actual execution of the shared client and when some other methods defered func that sets .DisableRetries = false executes?
it seems to be load based, if i just have a few user lookups, i never get in the retry loop that slows everything down. Appears that retry has exponential backoff. It doesn't always happen, but it's pretty consistent say 4 out of 5 tries with groups of significant enough size. Also it's different what group it gets hung up on
this effectively goes forever until the provider timeout is reached
AKA line 159 sets disable retry to true for a shared client line 160 defer a set it to false line 185 uses the client
but between 159 and 185 something else has already called the equivalent of 160 on the shared client and we 404 (because we are asking for a user with a group's id since we don't know if it's a user or a group) the underlying clients retries now are engaged and hence the loop.
so to recreate, you're gonna have to do a lot of lookups, it can be the same group over and over, and you're gonna have to have some 404s that are causing the clients internal retries.
maybe the a hack fix is to only set back the retry if it was set, and not for it back on?
client := meta.(*clients.Client).Users.UsersClient if(client.BaseClient.DisableRetries == false) { client.BaseClient.DisableRetries = true defer func() { client.BaseClient.DisableRetries = false }() }
that would suffer from same underlying race condition, but it'd be a lot less blast radius
this concurrency issue continues to block us from upgrading the the azuread provider :(
I was using v2.47.0, which worked fine at first, but started timing out as soon as i added a second datasource of the same type. The first one finishes, and then the second one just keeps on "Still reading...".
Downgrading to v2.37.0 fixes this behaviour.
Ha i almost thought i was going crazy over here. Thx @jangrewe for confirming..i REALLLLLLLY wanna upgrade the azuread provider
This is still an issue in 2.51.0, any chance of getting this fixed? ping @manicminer 😉