cpp_client_telemetry PauseTransmission and asynchronous HTTPS

The present behavior of PauseTransmission (and FlushAndTeardown) prevents the SDK from starting new HTTPS requests, though since uploads start asynchronously the exact point at which the pause takes effect is undefined.

As a result, we may get callbacks from an outstanding HTTP request some considerable (unbounded) time after the call to PauseTransmission. I would like to amend the API so that:

we can explicitly ask that the SDK discard responses that arrive while we are paused. This is also useful for FlushAndTeardown, where a response could arrive after we have freed heap that we would have accessed while processing the callback.
we can get a promise (or similar structure) to inform us when the pause is complete and no new responses will arrive.
we can abort pending requests if the HttpClient implementation supports abort.

I'm not an iOS expert, but we have seen watchdog SIGKILL in iOS around background execution and suspend. iOS expects that an application in the background will cooperate with the operating system to schedule its execution and minimize activity that would consume battery life.

Dec 21 '20 19:12 larvacea

@lalitb looked at somewhat related PR with a different goal in mind: to allow the scrub of pending HTTP requests. That is in case if user exercises the right to be forgotten and says "Delete my telemetry now" (in Edge browser scenario). But we still have some HTTP requests pending, in-flight.. So, ideally, we need to cancel / abort whatever we can cancel.

I'd be good to ensure that all HTTP client implementations have a provision for abort, that:

aborts;
if it can't abort, e.g. just finished uploading without aborting, then it returns the actual state of what happened to pending HTTP request (e.g. can't abort just because can't abort, or can't abort because already completed the upload).

For iOS specifically, and maybe for other OS - is there some flag we can use to tell that we are being shutdown fast (e.g. abnormal termination), and in that case we skip the wait, abort all pending threads, and try to proceed with fast shutdown. I think @bliptec was asking for this "leaky but fast and safe" shutdown option. This is what would've worked for iOS Pause+Flush+suspend scenarios as well..

Basically getting SDK in suspended state. Maybe something that is doing LogManager::Suspend rather than Teardown

Dec 21 '20 21:12 maxgolov

The iOS teams client is crashing at 0.07% of sessions, likely with a related problem. The most recent stack trace shows an upload job crashing with use-after-free because the config object has torn down during app shutdown. So in this case, we need to have some way to wait until transmission quiesces to avoid this race. Happy days!

My current plan of record: implement a Suspend as Max suggests, as this would work for both this crash scenario and the iOS application suspend scenario.

Jan 19 '21 17:01 larvacea

Please also note that while imperfect - there is a change that now ensures that the config object is static, and remains in scope for the entire process lifecycle. There was a regression where config object was going out-of-scope (alongside with potentially all items in it, including all const char * values). This is resolved in this commit.

Right now on Windows we reliably test / wait for all requests termination on FlushAndTeardown path. The logic for that lives in TPM here . I think how we may need to manage this - is to extract this code into separate routine, then invoke it from handlePause ... Maybe not only pausing, but actively aborting / canceling whatever uploads have been scheduled / in progress. That is to ensure that the PauseTransmission API is rather fast to return. Because if we don't abort, we do not know beforehand how long it'd take to ack / consolidate all pending HTTP requests. This may be up to ~20 seconds on slow internet connection.

Jan 19 '21 20:01 maxgolov

Before we go with Suspend implementation - is it possible for Apple HTTP client to cancel pending requests? Basically the onPause on mobile may need to have a handler that is not only preventing new, but canceling all active pending HTTP requests. That way PauseTransmission becomes blocking call that leaves no pending HTTP requests behind. And it should be safe to suspend in that state. We handle it like this for the FlushAndTeardown on Windows: give up to N sec to complete, then cancel / abort all other pending. For Linux (libcurl) we also abort by force-closing underlying socket connection. There should be something like this in Apple HTTP stack: https://developer.apple.com/documentation/foundation/urlsessiontask/1411591-cancel - cancel all on pause. We can immediately (without waiting for N seconds) cancel all pending requests if PauseTransmission is invoked on Mobile platforms. That should solve it?

Jan 21 '21 07:01 maxgolov

This is also useful for FlushAndTeardown, where a response could arrive after we have freed heap that we would have accessed while processing the callback.

It was a problem before. And we fully solved this in Windows HTTP client, by cancelling all pending requests, and waiting for their cancellation callbacks to come until we get zero pending requests. Then we shutdown safely. If this issue happens on iOS on FlushAndTeardown path, then we need to engage the URLSessionTask.cancel() API in Apple iOS stack implementation. Need to ensure that PauseTransmission behaves just like FlushAndTeardown with respect to cleaning up all pending HTTP requests on iOS.

we can abort pending requests if the HttpClient implementation supports abort.

+1 to abort pending requests , which is called cancel() on iOS.

Jan 21 '21 07:01 maxgolov

I'm looking more at this, and we already have a cancellation logic here in onPause handler:

https://github.com/microsoft/cpp_client_telemetry/blob/09b1e283b74e238c9c498e95948776ab804495d5/lib/system/TelemetrySystem.cpp#L131

Unless there is a scenario where request cannot be canceled, e.g. HTTP status callback is about to come? And what maybe happens - we get a callback from Apple HTTP stack from request we thought is cancelled, and we fall thru to destroy the object, then the callback gets serviced on object that's already freed. The way how it's handled in WinInet client: we not only cancel the request. But even if it's about to trip some response - we promptly unregister the listener, that way the framework can no longer call us anymore on that session. If that's not done in Apple HTTP client, this needs to be fixed in there.

Jan 21 '21 08:01 maxgolov

I'm not familiar with the iOS HTTP stack.

On Android, process exit (and not much else) can reliably cancel the thread executing the synchronous HTTP call, so the implementation is built with mutex locking to ensure that either the callback or the cancel will win any race, and callbacks that arrive (well, synchronous threads that continue after blocking, and potentially turn into callbacks) after cancel will abort before touching the object that cancel might have freed. The intention is to play well with the Windows cancel behavior, despite the very different implementation.

Jan 24 '21 20:01 larvacea