google-cloud-rust icon indicating copy to clipboard operation
google-cloud-rust copied to clipboard

Implement an OCC loop to update IAM policy bindings

Open coryan opened this issue 1 year ago • 6 comments

When working with IAM policies, applications typically want to do something like "add this member to that role". Because the IAM only support "change all the policies", applications need make multiple RPCs, something like:

  1. Call get_iam_policy(...) to get the current IAM policy
  2. Update the local copy of the policy
  3. Apply the change with set_iam_policy(...).

The IAM policy may change between steps 1 and 3, without some pre-condition that sequence of steps may overwrite the other changes. So the code becomes:

  1. Call get_iam_policy(..) to get the current IAM policy
  2. Update the local copy of the policy.
  3. Apply the change with set_iam_policy(...) and with the etag obtained in (1) to avoid overwrites
  4. If it succeeds: 🎉 we are done.
  5. If it fails with ABORTED that indicates a separate change, go back to step (1)
  6. If it fails with a different error, return this error.

coryan avatar Dec 24 '24 19:12 coryan

I would like to work on this.

nvnmandadhi avatar Sep 04 '25 14:09 nvnmandadhi

FYI: @nvnmandadhi

This will involve changes to the generator, which is in the librarian repo: https://github.com/googleapis/librarian/

You will find this guide useful: https://github.com/googleapis/google-cloud-rust/blob/main/doc/contributor/howto-guide-generated-code-maintenance.md#making-changes-to-sidekick


This is what the OCC loop looks like in C++: https://github.com/googleapis/google-cloud-cpp/blob/c06d686e32bd0fba5ab36f8fdedd729402428b94/google/cloud/spanner/admin/instance_admin_client.cc#L373-L406

I recommend manually implementing the OCC polling loop in one of the generated clients, with integration tests. When you have that, I think you should share the design here.

Then we can think about how to generalize the loop in the google-cloud-gax layer. And after that, we can think about how to generate the code.

dbolduc avatar Sep 04 '25 15:09 dbolduc

Due to the code changes required to other codebases, not going to work on this.

nvnmandadhi avatar Sep 04 '25 15:09 nvnmandadhi

OCC Loop Implementation

@dbolduc @coryan

I have completed the manual implementation of the OCC loop for IAM policy updates in the Storage client.

Complete OCC loop with etag-based conflict detection

Full Design Document

OCC Loop Implementation for IAM Policy Bindings

This implementation provides a manual OCC loop for safe IAM policy updates in the Storage client, following the approach outlined in the issue:

  • Prevents race conditions when multiple processes update IAM policies concurrently
  • Uses etag-based conflict detection with automatic retry
  • Follows the C++ reference implementation pattern
  • Ready for feedback before moving to google-cloud-gax layer
  1. (Complete): Manual implementation in Storage client
  2. (Next): Generalize in google-cloud-gax layer
  3. (Future): Auto-generate via librarian

The implementation follows the OCC pattern with etag-based concurrency control:

1. Get current policy with etag (get_iam_policy)
2. Apply user modifications via callback
3. Set updated policy with etag (set_iam_policy)
4. If ABORTED -> retry with exponential backoff
5. If other error -> immediate failure
6. If success -> done

Type Definitions

/// Callback function for updating an IAM policy.
///
/// Returns:
/// - `Ok(Some(Policy))` - updated policy to apply
/// - `Ok(None)` - cancel operation (not an error)
/// - `Err(Error)` - error during update
pub type IamUpdater<'a> = Box<dyn FnMut(Policy) -> Result<Option<Policy>> + Send + 'a>;

Design decisions:

  • FnMut for stateful updates (captures can be mutated)
  • Send for multi-threaded safety
  • Option<Policy> return for cancellation support
  • Sync (not async)

OccConfig

#[derive(Clone, Debug)]
pub struct OccConfig {
    /// Maximum number of retry attempts (default: 10)
    pub max_attempts: u32,
    
    /// Maximum total time for retry loop (default: 30s)
    pub max_duration: Duration,
    
    /// Backoff policy for delays between retries
    /// (default: exponential with 100ms initial, 32s max)
    pub backoff_policy: Arc<dyn BackoffPolicy>,
}
  • Sensible defaults aligned with other GCP libraries
  • Configurable for different use cases
  • Uses existing gax::backoff_policy::BackoffPolicy

Public Functions

  1. Core OCC Function
pub async fn update_iam_policy_with_occ(
    client: &StorageControl,
    resource: impl Into<String>,
    updater: IamUpdater<'_>,
    config: OccConfig,
) -> Result<Policy>
  1. Convenience Methods on StorageControl
impl StorageControl {
    /// Simple API with default configuration
    pub async fn update_iam_policy_with_retry<F>(
        &self,
        resource: impl Into<String>,
        updater: F,
    ) -> Result<Policy>
    where
        F: FnMut(Policy) -> Result<Option<Policy>> + Send + 'static;
    
    /// Advanced API with custom configuration
    pub async fn update_iam_policy_with_retry_config<F>(
        &self,
        resource: impl Into<String>,
        updater: F,
        config: OccConfig,
    ) -> Result<Policy>
    where
        F: FnMut(Policy) -> Result<Option<Policy>> + Send + 'static;
}

Examples

Add IAM Member

use google_cloud_storage::client::StorageControl;
use google_cloud_iam_v1::model::Binding;

let client = StorageControl::builder().build().await?;

let policy = client
    .update_iam_policy_with_retry(
        "projects/_/buckets/my-bucket",
        |mut policy| {
            policy.bindings.push(
                Binding::new()
                    .set_role("roles/storage.admin")
                    .set_members(["user:[email protected]"])
            );
            Ok(Some(policy))
        },
    )
    .await?;

Conditional Update

let policy = client
    .update_iam_policy_with_retry(
        "projects/_/buckets/my-bucket",
        |mut policy| {
            let member = "user:[email protected]";
            let role = "roles/storage.viewer";
            
            let exists = policy.bindings.iter().any(|b| {
                b.role == role && b.members.contains(&member.to_string())
            });
            
            if !exists {
                policy.bindings.push(
                    Binding::new()
                        .set_role(role)
                        .set_members([member])
                );
                Ok(Some(policy))
            } else {
                Ok(None)
            }
        },
    )
    .await?;

Custom Configuration

use google_cloud_storage::iam_occ::OccConfig;
use std::time::Duration;

let config = OccConfig {
    max_attempts: 5,
    max_duration: Duration::from_secs(10),
    ..Default::default()
};

let policy = client
    .update_iam_policy_with_retry_config(
        "projects/_/buckets/my-bucket",
        |mut policy| {
            // Update logic
            Ok(Some(policy))
        },
        config,
    )
    .await?;

Test Infrastructure

Implemented trait-based mocking for comprehensive testing:

/// Trait for IAM policy operations
#[cfg_attr(test, mockall::automock)]
#[async_trait::async_trait]
pub(crate) trait IamPolicyOperations {
    async fn get_iam_policy(&self, resource: &str) -> Result<Policy>;
    async fn set_iam_policy(&self, resource: &str, policy: Policy) -> Result<Policy>;
}

// Generic implementation for testability
async fn update_iam_policy_with_occ_impl<C: IamPolicyOperations>(
    client: &C,
    resource: impl Into<String>,
    updater: IamUpdater<'_>,
    config: OccConfig,
) -> Result<Policy>

// Public API wraps generic implementation (zero breaking changes)
pub async fn update_iam_policy_with_occ(
    client: &StorageControl,
    resource: impl Into<String>,
    updater: IamUpdater<'_>,
    config: OccConfig,
) -> Result<Policy> {
    update_iam_policy_with_occ_impl(client, resource, updater, config).await
}

Design Decisions & Rationale

1. Sync Updater (not async)

Decision: FnMut(Policy) -> Result<Option<Policy>>

Rationale:

  • Simpler for users (no async closure complexity)
  • Policy modifications are typically CPU-bound (no I/O)
  • Matches C++ reference implementation pattern
  • Can always add async variant later if needed

2. Option<Policy> Return

Decision: Return Option<Policy> from updater

Rationale:

  • Some(policy) = apply changes
  • None = cancel operation without error
  • Enables conditional updates ("only if doesn't exist")

3. update_iam_policy_with_retry

Decision: Use "retry" in public method names

Rationale:

  • More intuitive for users (describes behavior, not implementation)
  • Matches naming conventions in other GCP SDKs
  • "OCC" term reserved for module/implementation details
  • Aligns with user mental model

4. Trait-Based Mocking

Decision: Create IamPolicyOperations trait with generic implementation

Rationale:

  • Enables comprehensive unit testing without GCP
  • Preview of design (will move to gax)
  • Zero breaking changes (public API unchanged)
  • Follows project patterns (mockall)

5. Exponential Backoff

Decision: Use exponential backoff (100ms initial, 32s max)

Rationale:

  • Standard pattern for distributed systems
  • Reduces server load during contention
  • Matches C++ implementation behavior
  • Configurable via OccConfig

Preview

The current implementation previews design:

// Internal trait (pub(crate))
trait IamPolicyOperations {
    async fn get_iam_policy(&self, resource: &str) -> Result<Policy>;
    async fn set_iam_policy(&self, resource: &str, policy: Policy) -> Result<Policy>;
}

// Storage-specific
impl IamPolicyOperations for StorageControl { ... }

Future

// Public trait in gax layer
pub trait IamOperations {
    type Policy: Clone + Send;
    async fn get_iam_policy(&self, resource: &str) -> Result<Self::Policy>;
    async fn set_iam_policy(&self, resource: &str, policy: Self::Policy) -> Result<Self::Policy>;
}

// Extension trait for convenience
pub trait IamOperationsExt: IamOperations {
    async fn update_iam_policy_with_retry<F>(...) -> Result<Self::Policy>;
}

// Auto-implemented for all IAM services
impl<T: IamOperations> IamOperationsExt for T {}

This approach enables:

  • Reuse across all IAM-capable services
  • Consistent API patterns
  • Easy testing with mocks

Questions for Maintainers

1. API Design

  • Is update_iam_policy_with_retry an appropriate name?
  • Should we expose OccConfig in public API or use builder pattern?
  • Is sync updater acceptable, or do we need async variant?

2. Testing

  • Is trait-based mocking approach acceptable?
  • Should we add real GCP integration tests now or defer?
  • Is 100% scenario coverage sufficient?

3. Planning

  • Should trait be IamOperations or IamPolicyOperations?
  • Should extension trait auto-implement or be explicit?
  • Any concerns about generic implementation approach?

4. Documentation

  • Is rustdoc sufficient or need additional guides?
  • Should we add migration guide from manual get/set pattern?
  • Any specific examples you'd like to see?

Requesting feedback on:

  1. API design and naming choices
  2. Testing approach (trait-based mocking)
  3. Move to gax layer

vremyavnikuda avatar Oct 11 '25 13:10 vremyavnikuda

Thanks for the detailed design. Unfortunately we are a bit busy at the moment, and really cannot provide equally detailed feedback. I hope you did not waste too much time working on this feature.

coryan avatar Oct 14 '25 15:10 coryan

I have a vision of what it should look like, if it fits with your vision, then i would like to write code )

vremyavnikuda avatar Oct 14 '25 18:10 vremyavnikuda