terraform-provider-azuread icon indicating copy to clipboard operation
terraform-provider-azuread copied to clipboard

Error: azuread_users hangs after http 404 starting in v2.38.0

Open NattyMojo opened this issue 2 years ago • 11 comments

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritise this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritise the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureAD Provider) Version

Terraform v1.2.6 registry.terraform.io/hashicorp/azuread v2.38.0

Affected Resource(s)

  • azuread_users

Terraform Configuration Files

data "azuread_users" "users" {
  ignore_missing = true
  object_ids =  distinct(
    flatten(
      [for id,g in data.azuread_group.groups: g.members]
    )
  )
}

Debug Output

Note: output shortened for brevity and removed id's

2023-05-01T14:10:19.676-0600 [INFO] provider.terraform-provider-azuread_v2.38.0_x5.exe: 2023/05/01 14:10:19 [DEBUG] ============================ Begin AzureAD Response =========================== GET https://graph.microsoft.com/beta/users/:id Request ID:

HTTP/1.1 404 Not Found Transfer-Encoding: chunked Cache-Control: no-cache Client-Request-Id: Content-Type: application/json Date: Mon, 01 May 2023 20:10:19 GMT Deprecation: Thu, 17 Feb 2022 23:59:59 GMT

155 {"error":{"code":"Request_ResourceNotFound","message":"Resource ':id' does not exist or one of its queried reference-property objects are not present.","innerError":{"date":"2023-05-01T20:10:19","request-id":"","client-request-id":""}}} 0

============================= End AzureAD Response ============================:

Panic Output

Expected Behavior

data attribute should have gotten the member instance of the given id in AzureAD

Actual Behavior

Terraform runs through a good chunk of different groups with nested members, eventually reaching a point where it is reading from a particular resource and doesn't finish reading for the 20 minutes and then times out. While running some traces it looks like a 404 happens when trying to read a resource and terraform doesn't know what do do about it. It doesn't appear to do any retries and the only logs after that error are that another resource is waiting for this one to finish.

Reverting the azuread provider back to 2.37.2 fixed the issue.

Steps to Reproduce

  1. Make sure azuread provider is at v2.38.0
  2. Add user members or nested groups to an AzureAD group
  3. terraform plan with the above data attribute to retrieve the members

Important Factoids

References

NattyMojo avatar May 02 '23 15:05 NattyMojo

Hi @NattyMojo, thanks for reporting this. During the apparent hang, Terraform is actually retrying the request it's just not logged for every retry at the moment. The reason for retrying is that this is frequently our only option to defeat errors related to eventual consistency in Azure AD. However we only want to retry where it makes sense and I'll take a look to see what we can improve here. Thanks again for reporting this case.

manicminer avatar May 03 '23 08:05 manicminer

Thanks for taking a peek @manicminer. Glad you called out that it is retrying, because all I see in the logs after the 404 is the following

2023-05-01T14:10:22.558-0600 [TRACE] dag/walk: vertex "module.members_it_support.data.azuread_group.level_3_groups (expand)" is waiting for "module.members_it_support.data.azuread_users.level_3_users (expand)"
2023-05-01T14:10:26.952-0600 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/hashicorp/azuread\"] (close)" is waiting for "module.members_it_support.data.azuread_group.level_3_groups (expand)"

And it's just that over and over for about 20 minutes until the reading times out. Let me know if there's anything further I can do to assist

NattyMojo avatar May 03 '23 20:05 NattyMojo

@NattyMojo I'm fairly confident that this is related to recent changes to retry logic, however I don't believe it's what I initially thought. In v2.38.0, we explicitly disabled retries for data sources, which logically should mean that you don't see such long timeouts when trying to read nonexistent users using the azuread_users data source.

I unfortunately haven't yet been able to reproduce the specific symptom you described of a 20 minute hang. WOuld you be abl to confirm the following?

  • How many users are you trying to retrieve in one instance of the data source?
  • Were some/all of these users very recently added, e.g. in the same or preceding Terraform apply operation?
  • Did you specify ignore_missing = true?
  • Are you seeing a Terraform error message after the hang elapses? What's the message?

Additionally, if there's any way you could provide a debug log it would help greatly. I suspect in your case the log will be lengthy, but you should be able to post it in a gist after performing a regex search/replace to remove the UUIDs should you need to. Appreciate your feedback to help get to the bottom of this. Thanks!

manicminer avatar May 11 '23 23:05 manicminer

Hey @manicminer sorry for the delayed response. We are trying to pull ~120 user accounts, specifically in the one module that was throwing the error, we use the same thing over multiple applications and groups at the same time, so when running together it could be as many as about 500 users over multiple modules. Though they are within nested groups 3 levels deep. So we traverse from one group, get all users and groups, then follow that children set of groups and do the same (only distinct users), then take it's children groups and get all the users at that point too. Users have not been recently added to groups, these have existed for at least 6 months or longer. Yes ignore_missing is set to true.

I did try running again today and was able to get a plan made in 13ish minutes (most of which spent on the problematic section), but normally that step takes seconds at the most. When running today I did isolate to one application (one module per application) so maybe with less load on Azure it was able to complete the plan, but with more applications it could bog down more to where we are hitting those 20 minute timeouts, I'll give that a go when I can.

The error we were seeing at the end of the runs was:

Error: Retrieving user with object ID: "<removed id>"

  with module.members.data.azuread_users.level_3_users,
 data "azuread_users" "level_3_users" {

UsersClient.BaseClient.Get(): Get
"[https://graph.microsoft.com/beta/users/<removed id>"](https://graph.microsoft.com/beta/users/<removed id>%22):
context deadline exceeded
module.members.data.azuread_users.level_3_users: Still reading... [20m0s elapsed]

Below is the bulk of what's happening in our module. This has been working for us forever and just broke as of v2.38.0

data "azuread_group" "level_0" {
  display_name     = var.group_name
}

data "azuread_users" "level_1_users" {
  ignore_missing = true
  object_ids =  data.azuread_group.level_0.members
}

data "azuread_group" "level_1_groups" {
  for_each = setsubtract(
      data.azuread_group.level_0.members,
      [for u in data.azuread_users.level_1_users.users: u.object_id]
  )
  object_id = each.key
}

data "azuread_users" "level_2_users" {
  ignore_missing = true
  object_ids =  distinct(
    flatten(
      [for id,g in data.azuread_group.level_1_groups: g.members]
    )
  )
}

data "azuread_group" "level_2_groups" {
  for_each = setsubtract(
      distinct(
        flatten(
          [for id,g in data.azuread_group.level_1_groups: g.members]
        ),
      ),
      [for u in data.azuread_users.level_2_users.users: u.object_id]
  )
  object_id = each.key
}

# **This is where we break**
data "azuread_users" "level_3_users" {
  ignore_missing = true
  object_ids =  distinct(
    flatten(
      [for id,g in data.azuread_group.level_2_groups: g.members]
    )
  )
}

data "azuread_group" "level_3_groups" {
  for_each = setsubtract(
      distinct(
        flatten(
          [for id,g in data.azuread_group.level_2_groups: g.members]
        )
      ),
      [for u in data.azuread_users.level_3_users.users: u.object_id]
  )
  object_id = each.key
}

NattyMojo avatar May 18 '23 22:05 NattyMojo

@manicminer any thoughts as to a workaround here? we'd like to start leveraging resources added in 2.42 but can't move forward because of this reported issue

drdamour avatar Nov 28 '23 22:11 drdamour

fwiw: getting same error with latest versions

C:\temp\user-data> tf -version
Terraform v1.6.4
on windows_amd64
+ provider registry.terraform.io/hashicorp/azuread v2.46.0

drdamour avatar Nov 28 '23 22:11 drdamour

don't grok go myself, but looking at https://github.com/hashicorp/terraform-provider-azuread/pull/1090/files#diff-f9b9aabf70d6a2fcb30b7e6254917c65b36586d0ca8f38f6158ec125bc047e3fR160 all these defers assume the client has been locked in some way or is not shared but i don't think that's the case...isn't it then a race condition between the an actual execution of the shared client and when some other methods defered func that sets .DisableRetries = false executes?

it seems to be load based, if i just have a few user lookups, i never get in the retry loop that slows everything down. Appears that retry has exponential backoff. It doesn't always happen, but it's pretty consistent say 4 out of 5 tries with groups of significant enough size. Also it's different what group it gets hung up on

image

this effectively goes forever until the provider timeout is reached

AKA line 159 sets disable retry to true for a shared client line 160 defer a set it to false line 185 uses the client

but between 159 and 185 something else has already called the equivalent of 160 on the shared client and we 404 (because we are asking for a user with a group's id since we don't know if it's a user or a group) the underlying clients retries now are engaged and hence the loop.

so to recreate, you're gonna have to do a lot of lookups, it can be the same group over and over, and you're gonna have to have some 404s that are causing the clients internal retries.

maybe the a hack fix is to only set back the retry if it was set, and not for it back on?

client := meta.(*clients.Client).Users.UsersClient if(client.BaseClient.DisableRetries == false) { client.BaseClient.DisableRetries = true defer func() { client.BaseClient.DisableRetries = false }() }

that would suffer from same underlying race condition, but it'd be a lot less blast radius

drdamour avatar Nov 28 '23 23:11 drdamour

this concurrency issue continues to block us from upgrading the the azuread provider :(

drdamour avatar Jan 13 '24 16:01 drdamour

I was using v2.47.0, which worked fine at first, but started timing out as soon as i added a second datasource of the same type. The first one finishes, and then the second one just keeps on "Still reading...".

Downgrading to v2.37.0 fixes this behaviour.

jangrewe avatar Feb 28 '24 10:02 jangrewe

Ha i almost thought i was going crazy over here. Thx @jangrewe for confirming..i REALLLLLLLY wanna upgrade the azuread provider

drdamour avatar Feb 29 '24 03:02 drdamour

This is still an issue in 2.51.0, any chance of getting this fixed? ping @manicminer 😉

jangrewe avatar Jun 12 '24 13:06 jangrewe