Support creating hundreds of simultaneous Bigtable Table clients
In some of their tests our customers need to do things like this:
void F() {
std::string instance_id = "test-instance";
std::string project_id = "test-project";
std::mutex mu;
std::vector<cbt::DataClient> clients;
std::vector<std::thread> tasks;
for (int i = 0; i != 200; ++i) {
tasks.push_back(std::thread{[&] {
std::lock_guard lk(mu);
clients.push_back(CreateDefaultDataClient(instance_id, project_id, ...));
}});
tasks.push_back(std::thread{[&](int i) {
std::this_thread::sleep_for( /* random time */);
std::unique_lock lk(mu);
if (clients.size() < i + 1) return;
auto data_client = clients[i];
lk.unlock();
cbt::Table table(client, "table_id");
table.Apply( /* stuff */);
}, i});
}
for (auto& t : tasks) t.join();
}
They are reporting deadlocks in this code (or code equivalent to it).
Googlers can look at cl/358000434 for more details.
We will revisit in the future.
Problem Diagnosis
So this isn't a deadlock, and the issue doesn't have to do with the threads. What is actually happening is that the DataClients destruction may get held up while waiting for the result of the channel refreshing operation: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/connection_refresh_state.cc#L76-L78
Note that this timeout is 10s long: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/connection_refresh_state.cc#L30
So we are seeing delays that could be as high as 10s * N, as we wait for each client to go out of scope. If N is large enough this starts looking like a deadlock.
Simpler Repro
We can write a simpler test that exposes this behavior:
// This test will take longer than 10s as we are guaranteed to hit a refresh
// that will not complete. Ideally it would only take a second or so.
TEST(DeadlocksTest, DataClientNoThreads) {
auto constexpr kRefreshPeriod = std::chrono::milliseconds(500);
auto options =
Options{}
.set<GrpcCredentialOption>(grpc::InsecureChannelCredentials())
.set<GrpcNumChannelsOption>(1)
.set<MaxConnectionRefreshOption>(kRefreshPeriod);
auto dc = MakeDataClient("p", "i", options);
(void)dc->Channel();
std::this_thread::sleep_for(kRefreshPeriod * 1.5);
}
Other Observations
So one thing to notice is that DataConnection does not have this problem. Swapping out the DataClient for a DataConnection does not lead to timeouts. The key difference is that we pass around a temporary shared_ptr<CompletionQueue> when we create the Stubs. https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/bigtable_stub_factory.cc#L60
.... whereas in the DataClient we hold onto it for the lifetime of the object: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/common_client.h#L188
This means the DataConnection never refreshes its channels. This is a bug. It always backs out, because there are no more references to the CQ: https://github.com/googleapis/google-cloud-cpp/blob/18937be27e09b01ada592e090014e5b239aae674/google/cloud/bigtable/internal/connection_refresh_state.cc#L74-L75
We can "fix" the DataConnection problem by using a shared_ptr<internal::CompletionQueueImpl> (owned by the background threads) for channel refreshing. By "fix", I mean the DataConnection might also need to wait 10s for the CQ work to be done before it can be destroyed. Yay?
Paths Forward
There are a few things we can do.
My Suggestion
I think we can simplify the connection refresh logic to periodically call grpc::Channel::GetState(true).
https://github.com/dbolduc/google-cloud-cpp/commit/d3a9a4182a5751f6c2723eae41e8298e69db9120
The "gRPC Connectivity Semantics and API" doc says calling GetState(true) only matters if the channel is currently IDLE. The channel will continue trying to connect on its own if it is in CONNECTING or TRANSIENT_FAILURE. If it is READY or SHUTDOWN, we are not affecting it. So making one call has the same effect as what we are doing now.
The only difference between this suggestion and the current code is a log entry on why the channel couldn't refresh: https://github.com/googleapis/google-cloud-cpp/blob/501fa37eca2002f0c917984e8a14f6bf61591664/google/cloud/bigtable/internal/connection_refresh_state.cc#L82-L83
I do not think the log entry is worth the extra complications that come with it. We continue scheduling the refreshes regardless of its value.
Pros:
- No waiting on gRPC
- Simpler code
- Lets us deprecate and delete
CQ::AsyncWaitConnectionReady.
- Lets us deprecate and delete
Cons:
- Less information on logs
- ... sort of. We could always log the current state if we want to.
- I do not have an integration test for refresh behavior. We are trusting logic and reading comprehension.
Alternatives Considered
Do Nothing
We could do nothing. The Google team who reported the bug has a vector of clients, protected by a mutex. They want to test that their data access is thread safe. They can just disable channel refreshing.
Make kConnectionReadyTimeout a configurable option.
If anyone needs to make hundreds of clients, they can reduce this timeout to have the application exit considerably faster.
Make AsyncConnectionReadyFuture cancellable. Specifically this operation: https://github.com/googleapis/google-cloud-cpp/blob/f779bb99395bbdfe3aac51864ff2abf2a0b6dffc/google/cloud/internal/async_connection_ready.cc#L75-L78
I did get this to work. My idea was to break up the 10s deadline into smaller pieces, e.g. of 1s each. Then if the operation was cancelled, we can back out early without spending the full 10s.
https://github.com/dbolduc/google-cloud-cpp/commit/b7f1fbcada97146a5121a985bcb20eff66ed7b24
Pros:
- This CQ method is in our public API, so cancelling is technically a feature.
- ... but we should probably deprecate it. It is too low level, and I doubt it is in use.
Cons:
- It adds complication
- More work is done on the background threads
The problem with the proposed approach of just calling channel->GetState(true) is that the state of the channel might not get updated, unless it is being actively polled (by something like channel->NotifyOnStateChange()).
And to be explicit, channel->NotifyOnStateChange() is not a cancellable operation. Which is why we would have to manually cancel it in one of the alternatives considered.
Closing, the original motivation no longer applies.