yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[DocDB] Fix race that allows duplicate snapshot schedules

Open joe-maley opened this issue 2 years ago • 0 comments

Jira Link: DB-3151

Description

Commit 8393a7c81b89cb0c668315205517c68708d17870 introduced a check in CreateSchedule to fail a snapshot schedule creation request if a schedule already exists in the same keyspace. This holds a lock while reading schedules_ but must release the lock before invoking SynchronizedWrite to avoid deadlock.

{
  std::lock_guard<std::mutex> lock(mutex_);
  const auto& existing_schedule = FindSnapshotSchedule(namespace_name, namespace_type);
  ...
}
...
RETURN_NOT_OK(SynchronizedWrite(std::move(write_batch), leader_term, deadline, &context_));

We have the following race if concurrent requests attempt to create a snapshot in the same keyspace:

Thread-1: Takes the mutex_, verifies that a schedule does not exist for ysql.joe. Releases the mutex_.
Thread-2: Takes the mutex_, verifies that a schedule does not exist for ysql.joe. Releases the mutex_.
Thread-1: Invokes SynchronizedWrite()
Thread-2: Invokes SynchronizedWrite()

One possible fix would be to introduce a pending_keyspaces_ structure to track snapshot schedules that we will write.

cc @sanketkedia

joe-maley avatar Aug 10 '22 18:08 joe-maley