aws-mobile-appsync-sdk-android Implement callback to specify a custom retry policy

I am using AppSync with Cognito. The app I am building allows the user to make some mutations offline, even before the authentication is complete. Basically there is no need for the user to use the credentials, but those are auto-generated, and the app automatically sign-up and sign-in.

So, when the user starts the app for the first time, offline, and makes some mutations, as soon as the device goes online, the mutations will be sent to the cloud, but most of them are going to fail, cause the authentication is usually slower.

So my question is: can I prevent the mutations to be sent till the authentication is valid?

If that's not possible, is it possible to cancel all the enqueued mutations?

Dec 05 '18 20:12 Joseph82

Hi @Joseph82 ,

Just to clarify, are you using Cognito Userpools for sign-in? If so, You may look at using AWSMobileClient for authentication with Userpools because it supports blocking calls until a user is signed-in. The client has to be aware that you are signed-in at first to transition between SIGNED_IN and SIGNED_OUT_USER_POOLS_TOKENS_INVALID. The latter will pause the operation until you can sign the user in again. More documentation

Dec 05 '18 21:12 minbi

Hi @minbi , thank you for the answer.

Yes, I am using Cognito Userpools for sign-in.

I see what you mean, but if I block the user till he sign-in it means he cannot start using the app offline. Is that correct?

Dec 05 '18 21:12 Joseph82

I believe the offline mutation shouldn't be blocked since in that case the SDK shouldn't be attempting to grab credentials when offline is detected. I'll let @cbommas chime in one the exact mechanism of offline mutations at this version of the SDK.

Dec 05 '18 21:12 minbi

Indeed when offline is detected there is no attempt to grab credentials, but as soon as online is detected, the SDK is going to try to sync data, calling the mutation in the queue one by one. As soon as the device is online, I also start the sign-in procedure, but usually it is slower and in the meantime all mutations are going to fail.

Dec 05 '18 21:12 Joseph82

So, when the SDK grabs credentials from AWSMobileClient, it will block if needs fresh-er credentials, thus waiting for you to sign-in, if required. Unless the mutation has a timeout that I'm not aware of.

Dec 05 '18 21:12 minbi

But right now I am using AWSAppSyncClient. Is it dependent to AWSMobileClient ?

Dec 05 '18 21:12 Joseph82

Ah, apologies for not being clear.

In my current proposal, AWSAppSyncClient would be using AWSMobileClient as the object that provides the Userpools tokens through the CognitoUserPoolsAuthProvider interface.

Therefore when the offline mutations are being executed they will block for credentials by calling AWSMobileClient.

Dec 05 '18 21:12 minbi

That sounds great @minbi But how to tell AWSAppSyncClient to use AWSMobileClient?

Dec 05 '18 22:12 Joseph82

You would specify it in the builder. One way to do it would be as follows:

client = AWSAppSyncClient.builder()
         .context(context)
         .awsConfiguration(awsConfiguration)
         .cognitoUserPoolsAuthProvider(new CognitoUserPoolsAuthProvider() {
            @Override
            public String getLatestAuthToken() {
                try {
                    return AWSMobileClient.getInstance().getTokens().getIdToken().getTokenString();
                } catch (Exception e){
                    Log.e("APPSYNC_ERROR", e.getLocalizedMessage());
                    return e.getLocalizedMessage();
                }
             }
         }).build();

Dec 05 '18 23:12 minbi

Uhm interesting. Even though if the token is not yet available I will still have the same problem, cause that call doesn't seem blocking to me, indeed it just returns a String that the AWSAppSyncClient is going to try to use as token.

Dec 05 '18 23:12 Joseph82

Hi @minbi I think you gave me the right tip. I am now overriding getLatestAuthToken, but instead of getting the idToken with AWSMobileClient , I use the following for signUp and signIn:

CognitoUserPool.signUp() instead of CognitoUserPool.signUpInBackground()

CognitoUser.getSession() instead of CognitoUser.getSessionInBackground()

since getLatestAuthToken() is already executed in a different (non-UI) thread.

When I get the session, then I just call cognitoUserSession.getIdToken().getJWTToken()

This seems to solve my problem. There is still, by the way, the possibility that for temporary connection leak I am not able to get a valid session. Then the problem still stands, since the mutation will fail and I have no idea about how to recover it.

Dec 06 '18 16:12 Joseph82

I checked the implementation of AWSAppSyncClient and I saw that OkHttpClient is using an interceptor for handling failure and retry: RetryInterceptor

Apparently this interceptor only reschedule the request, in case the status code of the response is between 500 and < 600 (or equal to 429). This means that there is no retry attempt in case the request fails because of authentication issues (401 Unauthorized).

Is there a way for retrying when these kind of failures happen?

Dec 09 '18 15:12 Joseph82

The PersistentMutationsCallback will notify you of failures in offline mutation queue. You can attach it to the AppSync client when building it. See here

Dec 10 '18 21:12 minbi

Thanks for replying @minbi

That's actually what I do, but in case of failure the callback just returns a PersistentMutationsError object which has just the following three methods:

getRecordIdentifier()
getMutationClassName()
getException()

I thought getRecordIdentifier could be useful for understanding which mutation went wrong, but apparently it isn't.

Dec 10 '18 21:12 Joseph82

Ok, so from my understanding currently there is no way for preventing a mutation to get lost after authentication failure. As said all the information returned in the PersistentMutationsError are not useful for re-trying the mutation.

I am just wondering if it is something that might be implemented in the near future. Otherwise I might try to write a custom OkHttp Interceptor.

Dec 13 '18 07:12 Joseph82

@Joseph82 This is something that I am looking into.

One idea/question that I'd like your thoughts on: In the retry strategy, we do not retry on 401 or 403 errors as we don't deem them as being re-triable. The idea I have is to retry if we have a 401 or 403 error, but restrict it to a maximum of 1 retry. Would that work in your case?

Dec 19 '18 02:12 scb01

Also, can you give v2.7.1 of the SDK a try? It contains a number of fixes to the threading model aimed at fixing race conditions and see if that fixes/reduces the temporary connection leak that you referenced earlier in this thread.

Dec 19 '18 04:12 scb01

Hi @cbommas I think the retry strategy defined in the RetryInterceptor makes sense, cause normally if the client is sending a wrong request (e.g. MalformedQueryString), there is no reason for retrying (the same request is gonna fail again). So the idea is to retry in case the possibility of failure depends on something external (hopefully a temporary issue) like a 503 ServiceUnavailable.

However, for the authentication token the client often depends on HTTP connection (unless the token is still locally available), which of course can fail for external non-stationary causes.

So, giving 1 more retry attempt, for sure is gonna diminish the risk of failure, but I think the problem still stands. It is maybe an acceptable risk, cause also with a normal mutation there is no 100% guarantee that is gonna be executed (I think the retry limit is currently set to 12).

I was wondering if it would be possible to delegate the developer the power to decide if the cause of the failure was external or not, either transforming:

public interface CognitoUserPoolsAuthProvider {
    public String getLatestAuthToken();
}

into something like:

public interface CognitoUserPoolsAuthProvider {
    public String onSuccess(String authToken);
    public void onFailure();
}

so that in case I couldn't get the token due to a temporary server error, I can call onFailure() and the mutation can be stopped and retried. Of course this put responsibility in the developer hands, cause I can theoretically call onFailure() also if I the HTTP Cognito error is due to a client mistake.

Or sending a specific Exception from getLatestAuthToken() that will have the same meaning as onFailure.

And thanks for the new 2.7.1, it seems indeed to address all the scenarios I was worrying about. Good job!

Dec 21 '18 09:12 Joseph82

@cbommas any feedback?

Jan 21 '19 12:01 Joseph82

@Joseph82

Sorry for the delay in my response. Here are some of my thoughts

I am a little reluctant to make changes to the interface at the credentials provider level, as it will have some backward compatibility impact. A way to achieve additional resiliency could be for developers to retry in their implementation of the credentials provider, or investigate the token returned and refresh it if it is going to expire in the new few seconds etc.
If we move outside of the Credentials Provider space, a way would be to provide a shim in the RetryInterceptor to specify a custom retry policy to continue the retries.
The fallback mechanism would be that the developer can always retry the mutation by looking at the error returned in the onError callback, however that would mean that the mutation will lose its place in the mutation queue and go to the back of the queue.

What are your thoughts on the first two options?

Jan 24 '19 02:01 scb01

Hi @cbommas

I would avoid the third option, cause the order of the mutations in the queue is most of the time crucial for the correct synchronization.

In my opinion a custom retry policy for the RetryInterceptor would be a good option.

Jan 28 '19 08:01 Joseph82

Thanks for feedback @Joseph82 I will mark this thread as a feature request and get back to you once I have a timeline.

Jan 28 '19 15:01 scb01

Hi Team, Any update on this feature request ?

Oct 01 '23 12:10 ds24449