spyglass icon indicating copy to clipboard operation
spyglass copied to clipboard

Table Metadata Lock by long populate calls

Open samuelbray32 opened this issue 7 months ago • 3 comments

Describe the bug

  • User 1 executes a long-running populate call (e.g. spikesorting on long interval with limited compute resource)

  • When populate begins it starts a transaction with this query in datajoint: self.query("START TRANSACTION WITH CONSISTENT SNAPSHOT")

    • This creates a shared lock on referenced tables to ensure consistency during the process.
      • In particular, highly referenced tables like AnalysisNwbfile
    • These locks persist through the populate call
    • They do let reads and apparently inserts happen since they don't change the table
    • They don't allow altering the table with new foreign key references
  • User 2 attempts to declare a compute table with reference to AnalysisNwbfile

    • This attempts to get a metadata lock on AnalysisNwbfile to alter with new fk-ref
    • This lock is incompatible with User 1's shared lock
    • Stalls with pending lock until populate finishes executing
  • User 3+ attempts to access data via fetch_nwb

    • Requires lock on AnalysisNwbfile
    • Would be compatible with populate's lock
    • However, request is stuck in queue behind User 2's pending lock for table declaration
    • fetch_nwb call stalls until User 2's pending lock for table declaration times out and exits queue

Symptoms

  • Multiple users reporting inability to access data with fetch_nwb
  • Stall/timeout when declaring new table

Diagnostic Tools This error was most effective to debug on the mysql level. Useful queries provided here for future reference

Watch active sql processes Requires sql admin privileges to view other users
from time import sleep
from IPython.display import clear_output
import datajoint as dj
for i in range(100):
    clear_output(wait=True)
    for process in dj.conn().query("SHOW FULL PROCESSLIST;").fetchall():
        print(process)
    sleep(2)
Get information on processes holding/pending locks on a table
def execute_query_and_fetch_results(query):
    """
    Execute the given query and fetch the results.

    Parameters
    ----------
    query : str
        The SQL query to execute.

    Returns
    -------
    list of tuples
        The results of the query.
    """
    results = dj.conn().query(query).fetchall()
    return results

# Check if performance schema is enabled
performance_schema_query = "SHOW VARIABLES LIKE 'performance_schema';"
performance_schema_status = execute_query_and_fetch_results(performance_schema_query)
print("Performance Schema Status:")
for status in performance_schema_status:
    print(status)

# Replace 'common_nwbfile' and 'analysis_nwbfile' with your actual database and table name
database_name = 'common_nwbfile' # the datjoint schema name
table_name = 'analysis_nwbfile' # the table name

# Step 2: Query Metadata Locks along with owner and user information
metadata_locks_query = f"""
SELECT 
    ml.OBJECT_TYPE, 
    ml.OBJECT_SCHEMA, 
    ml.OBJECT_NAME, 
    ml.LOCK_TYPE, 
    ml.LOCK_STATUS, 
    ml.OWNER_THREAD_ID,
    t.PROCESSLIST_ID, 
    t.PROCESSLIST_USER, 
    t.PROCESSLIST_HOST, 
    t.PROCESSLIST_DB, 
    t.PROCESSLIST_COMMAND, 
    t.PROCESSLIST_TIME, 
    t.PROCESSLIST_STATE, 
    t.PROCESSLIST_INFO
FROM 
    performance_schema.metadata_locks AS ml
JOIN 
    performance_schema.threads AS t 
ON 
    ml.OWNER_THREAD_ID = t.THREAD_ID
WHERE 
    ml.OBJECT_SCHEMA = '{database_name}' 
    AND ml.OBJECT_NAME = '{table_name}';
"""

metadata_locks = execute_query_and_fetch_results(metadata_locks_query)

# Display metadata locks with additional information
print("Metadata Locks with Owner and User Information:")
for lock in metadata_locks:
    print(lock)

Looking Forward This issue arose due to a particularly long (~24+ hour) populate call on a single row. However, factors that increase likelihood of such event are:

  • Growing number of foreign key references to AnalysisNwbfile in user custom tables (current `AnalysisNwbfile has ~180 children) increasing the odds of conflicting lock requests
  • Increase in users running lots of parallelized populate calls that increase total table lock time

Potential Redresses 1. Precompute results

  • Datajoint begins the shared lock of a transation within populate prior to calling make.
    • We already intercept the populate call in the mixin class to improve parallelization functionality #1001
  • For tables with long-running make functions we could move the analysis from the make function to a new compute function
    • In the intercepted populate call, "pre-compute" the results prior to beginning the transaction and calling make to avoid long-running transactions
    • Results could be stored in a TempAnalysisNwbfile table. make() would need to be passed a key to this table and would be responsible for moving the data into AnalysisNwbfile
    • This solution would synergize with ideas for a "recompute" functionality for memory management #917

2. Database scheduling

  • Socially define that long-running populate calls (or alternatively table declarations) should happen at certain times (e.g. evenings, weekends, etc.)
  • Harder to manage with growing number of users

3. Reduce foreign key references to a single table

  • Similar to #963, issue arises from centrality of some tables in the network
  • Could alleviate by having multiple AnalysisNwbfile_{pipeline} tables
    • Make AnalysisNwbfile a mixin class.
    • Implement a table of this class for each pipeline
    • Since user's don't interact with this AnalysisNwbfile as directly, solution might avoid concerns of similar ideas for IntervalList #943
    • Would require something like a merge table to group different sources together for AnalysisNwbfileKachery
  • Could consider other sql partitionaing strategies (e.g. sharding of high-centrality tables)

samuelbray32 avatar Jul 12 '24 16:07 samuelbray32