google-cloud-cpp Support creating hundreds of simultaneous Bigtable Table clients

In some of their tests our customers need to do things like this:

void F() {

  std::string instance_id = "test-instance";
  std::string project_id = "test-project";

  std::mutex mu;
  std::vector<cbt::DataClient> clients;
  std::vector<std::thread> tasks;

  for (int i = 0; i != 200; ++i) {
    tasks.push_back(std::thread{[&] {
      std::lock_guard lk(mu);
      clients.push_back(CreateDefaultDataClient(instance_id, project_id, ...));
    }}); 
    tasks.push_back(std::thread{[&](int i) {
      std::this_thread::sleep_for( /* random time */);
      std::unique_lock lk(mu);
      if (clients.size() < i + 1) return;
      auto data_client = clients[i];
      lk.unlock();
      cbt::Table table(client, "table_id");
      table.Apply( /* stuff */);
    }, i}); 
  }
  for (auto& t : tasks) t.join();
}

They are reporting deadlocks in this code (or code equivalent to it).

Feb 18 '21 19:02 coryan

Googlers can look at cl/358000434 for more details.

Feb 18 '21 19:02 coryan

We will revisit in the future.

Feb 24 '22 19:02 coryan

Problem Diagnosis

So this isn't a deadlock, and the issue doesn't have to do with the threads. What is actually happening is that the DataClients destruction may get held up while waiting for the result of the channel refreshing operation: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/connection_refresh_state.cc#L76-L78

Note that this timeout is 10s long: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/connection_refresh_state.cc#L30

So we are seeing delays that could be as high as 10s * N, as we wait for each client to go out of scope. If N is large enough this starts looking like a deadlock.

Simpler Repro

We can write a simpler test that exposes this behavior:

// This test will take longer than 10s as we are guaranteed to hit a refresh
// that will not complete. Ideally it would only take a second or so.
TEST(DeadlocksTest, DataClientNoThreads) {
  auto constexpr kRefreshPeriod = std::chrono::milliseconds(500);
  auto options =
      Options{}
          .set<GrpcCredentialOption>(grpc::InsecureChannelCredentials())
          .set<GrpcNumChannelsOption>(1)
          .set<MaxConnectionRefreshOption>(kRefreshPeriod);

  auto dc = MakeDataClient("p", "i", options);
  (void)dc->Channel();
  std::this_thread::sleep_for(kRefreshPeriod * 1.5);
}

Other Observations

So one thing to notice is that DataConnection does not have this problem. Swapping out the DataClient for a DataConnection does not lead to timeouts. The key difference is that we pass around a temporary shared_ptr<CompletionQueue> when we create the Stubs. https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/bigtable_stub_factory.cc#L60

.... whereas in the DataClient we hold onto it for the lifetime of the object: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/common_client.h#L188

This means the DataConnection never refreshes its channels. This is a bug. It always backs out, because there are no more references to the CQ: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/connection_refresh_state.cc#L74-L75

We can "fix" the DataConnection problem by using a shared_ptr<internal::CompletionQueueImpl> (owned by the background threads) for channel refreshing. By "fix", I mean the DataConnection might also need to wait 10s for the CQ work to be done before it can be destroyed. Yay?

Paths Forward

There are a few things we can do.

My Suggestion

I think we can simplify the connection refresh logic to periodically call grpc::Channel::GetState(true).

https://github.com/dbolduc/google-cloud-cpp/commit/d3a9a4182a5751f6c2723eae41e8298e69db9120

The "gRPC Connectivity Semantics and API" doc says calling GetState(true) only matters if the channel is currently IDLE. The channel will continue trying to connect on its own if it is in CONNECTING or TRANSIENT_FAILURE. If it is READY or SHUTDOWN, we are not affecting it. So making one call has the same effect as what we are doing now.

The only difference between this suggestion and the current code is a log entry on why the channel couldn't refresh: https://github.com/googleapis/google-cloud-cpp/blob/501fa37eca2002f0c917984e8a14f6bf61591664/google/cloud/bigtable/internal/connection_refresh_state.cc#L82-L83

I do not think the log entry is worth the extra complications that come with it. We continue scheduling the refreshes regardless of its value.

Pros:

No waiting on gRPC
Simpler code
- Lets us deprecate and delete CQ::AsyncWaitConnectionReady.

Cons:

Less information on logs
- ... sort of. We could always log the current state if we want to.
I do not have an integration test for refresh behavior. We are trusting logic and reading comprehension.

Alternatives Considered

Do Nothing

We could do nothing. The Google team who reported the bug has a vector of clients, protected by a mutex. They want to test that their data access is thread safe. They can just disable channel refreshing.

Make `kConnectionReadyTimeout` a configurable option.

If anyone needs to make hundreds of clients, they can reduce this timeout to have the application exit considerably faster.

Make `AsyncConnectionReadyFuture` cancellable. Specifically this operation: https://github.com/googleapis/google-cloud-cpp/blob/f779bb99395bbdfe3aac51864ff2abf2a0b6dffc/google/cloud/internal/async_connection_ready.cc#L75-L78

I did get this to work. My idea was to break up the 10s deadline into smaller pieces, e.g. of 1s each. Then if the operation was cancelled, we can back out early without spending the full 10s.

https://github.com/dbolduc/google-cloud-cpp/commit/b7f1fbcada97146a5121a985bcb20eff66ed7b24

Pros:

This CQ method is in our public API, so cancelling is technically a feature.
- ... but we should probably deprecate it. It is too low level, and I doubt it is in use.

Cons:

It adds complication
More work is done on the background threads

Aug 23 '22 00:08 dbolduc

The problem with the proposed approach of just calling channel->GetState(true) is that the state of the channel might not get updated, unless it is being actively polled (by something like channel->NotifyOnStateChange()).

And to be explicit, channel->NotifyOnStateChange() is not a cancellable operation. Which is why we would have to manually cancel it in one of the alternatives considered.

Oct 05 '22 15:10 dbolduc

Closing, the original motivation no longer applies.

Mar 15 '23 18:03 coryan

Support creating hundreds of simultaneous Bigtable Table clients

Problem Diagnosis

Simpler Repro

Other Observations

Paths Forward

My Suggestion

Alternatives Considered

Do Nothing

Make kConnectionReadyTimeout a configurable option.

Make AsyncConnectionReadyFuture cancellable. Specifically this operation: https://github.com/googleapis/google-cloud-cpp/blob/f779bb99395bbdfe3aac51864ff2abf2a0b6dffc/google/cloud/internal/async_connection_ready.cc#L75-L78

Make `kConnectionReadyTimeout` a configurable option.

Make `AsyncConnectionReadyFuture` cancellable. Specifically this operation: https://github.com/googleapis/google-cloud-cpp/blob/f779bb99395bbdfe3aac51864ff2abf2a0b6dffc/google/cloud/internal/async_connection_ready.cc#L75-L78