spyglass
spyglass copied to clipboard
Table Metadata Lock by long populate calls
Describe the bug
-
User 1
executes a long-running populate call (e.g. spikesorting on long interval with limited compute resource) -
When populate begins it starts a transaction with this query in datajoint:
self.query("START TRANSACTION WITH CONSISTENT SNAPSHOT")
- This creates a shared lock on referenced tables to ensure consistency during the process.
- In particular, highly referenced tables like
AnalysisNwbfile
- In particular, highly referenced tables like
- These locks persist through the populate call
- They do let reads and apparently inserts happen since they don't change the table
- They don't allow altering the table with new foreign key references
- This creates a shared lock on referenced tables to ensure consistency during the process.
-
User 2
attempts to declare a compute table with reference toAnalysisNwbfile
- This attempts to get a metadata lock on
AnalysisNwbfile
to alter with new fk-ref - This lock is incompatible with
User 1
's shared lock - Stalls with pending lock until populate finishes executing
- This attempts to get a metadata lock on
-
User 3+
attempts to access data viafetch_nwb
- Requires lock on
AnalysisNwbfile
- Would be compatible with populate's lock
- However, request is stuck in queue behind
User 2
's pending lock for table declaration -
fetch_nwb
call stalls untilUser 2
's pending lock for table declaration times out and exits queue
- Requires lock on
Symptoms
- Multiple users reporting inability to access data with
fetch_nwb
- Stall/timeout when declaring new table
Diagnostic Tools This error was most effective to debug on the mysql level. Useful queries provided here for future reference
Watch active sql processes
Requires sql admin privileges to view other usersfrom time import sleep
from IPython.display import clear_output
import datajoint as dj
for i in range(100):
clear_output(wait=True)
for process in dj.conn().query("SHOW FULL PROCESSLIST;").fetchall():
print(process)
sleep(2)
Get information on processes holding/pending locks on a table
def execute_query_and_fetch_results(query):
"""
Execute the given query and fetch the results.
Parameters
----------
query : str
The SQL query to execute.
Returns
-------
list of tuples
The results of the query.
"""
results = dj.conn().query(query).fetchall()
return results
# Check if performance schema is enabled
performance_schema_query = "SHOW VARIABLES LIKE 'performance_schema';"
performance_schema_status = execute_query_and_fetch_results(performance_schema_query)
print("Performance Schema Status:")
for status in performance_schema_status:
print(status)
# Replace 'common_nwbfile' and 'analysis_nwbfile' with your actual database and table name
database_name = 'common_nwbfile' # the datjoint schema name
table_name = 'analysis_nwbfile' # the table name
# Step 2: Query Metadata Locks along with owner and user information
metadata_locks_query = f"""
SELECT
ml.OBJECT_TYPE,
ml.OBJECT_SCHEMA,
ml.OBJECT_NAME,
ml.LOCK_TYPE,
ml.LOCK_STATUS,
ml.OWNER_THREAD_ID,
t.PROCESSLIST_ID,
t.PROCESSLIST_USER,
t.PROCESSLIST_HOST,
t.PROCESSLIST_DB,
t.PROCESSLIST_COMMAND,
t.PROCESSLIST_TIME,
t.PROCESSLIST_STATE,
t.PROCESSLIST_INFO
FROM
performance_schema.metadata_locks AS ml
JOIN
performance_schema.threads AS t
ON
ml.OWNER_THREAD_ID = t.THREAD_ID
WHERE
ml.OBJECT_SCHEMA = '{database_name}'
AND ml.OBJECT_NAME = '{table_name}';
"""
metadata_locks = execute_query_and_fetch_results(metadata_locks_query)
# Display metadata locks with additional information
print("Metadata Locks with Owner and User Information:")
for lock in metadata_locks:
print(lock)
Looking Forward This issue arose due to a particularly long (~24+ hour) populate call on a single row. However, factors that increase likelihood of such event are:
- Growing number of foreign key references to
AnalysisNwbfile
in user custom tables (current `AnalysisNwbfile has ~180 children) increasing the odds of conflicting lock requests - Increase in users running lots of parallelized populate calls that increase total table lock time
Potential Redresses 1. Precompute results
- Datajoint begins the shared lock of a transation within
populate
prior to calling make.- We already intercept the
populate
call in the mixin class to improve parallelization functionality #1001
- We already intercept the
- For tables with long-running make functions we could move the analysis from the
make
function to a newcompute
function- In the intercepted populate call, "pre-compute" the results prior to beginning the transaction and calling make to avoid long-running transactions
- Results could be stored in a
TempAnalysisNwbfile
table.make()
would need to be passed a key to this table and would be responsible for moving the data intoAnalysisNwbfile
- This solution would synergize with ideas for a "recompute" functionality for memory management #917
2. Database scheduling
- Socially define that long-running populate calls (or alternatively table declarations) should happen at certain times (e.g. evenings, weekends, etc.)
- Harder to manage with growing number of users
3. Reduce foreign key references to a single table
- Similar to #963, issue arises from centrality of some tables in the network
- Could alleviate by having multiple
AnalysisNwbfile_{pipeline}
tables- Make
AnalysisNwbfile
a mixin class. - Implement a table of this class for each pipeline
- Since user's don't interact with this
AnalysisNwbfile
as directly, solution might avoid concerns of similar ideas forIntervalList
#943 - Would require something like a merge table to group different sources together for
AnalysisNwbfileKachery
- Make
- Could consider other sql partitionaing strategies (e.g. sharding of high-centrality tables)