MySQL connector takes a long time to start when working with large tables
Bug Description
When running some internal testing against a huge MySQL table (100M+ records) we've found out that before the sync actually starts, connector hangs for a long time (10s of minutes).
To Reproduce
Steps to reproduce the behavior:
- Create a reasonably large MySQL table - for example one with 100M records with text
- Try to run a sync against this MySQL instance + table
- Observe logs and check how much time passes between the sync start and actual first document ingested
Expected behavior
First document is ingested reasonably fast.
Preliminary investigation showed that this method is taking too long to execute:
@retryable(
retries=RETRIES,
interval=RETRY_INTERVAL,
strategy=RetryStrategy.EXPONENTIAL_BACKOFF,
)
async def get_column_names_for_query(self, query):
async with self.connection.cursor(aiomysql.cursors.SSCursor) as cursor:
await cursor.execute(query)
return [f"{column[0]}" for column in cursor.description]
Query is SELECT * FROM {table} - probably it actually tries to fetch all data into memory?
Potentially it can also cause OOMs on cloud that prevents from multiple syncs to happen.
Nice find. Yes, this seems like a bug we should fix sooner than later.
CC @danajuratoni for prioritization.
Sample dataset could be taken from here: https://microsoft.github.io/msmarco/Datasets.html
We could have a "LIMIT 1" in that query and fix this.
Yup, easy fix that'll take longer to test - we don't have a solid dataset to verify the problem
I think we could just trust that a LIMIT 1 would be sufficient, or anything that doesn't try to fetch the whole dataset in one go. As long as we can test that it doesn't break anything on a small dataset size, worst case scenario is we have to reopen this if we get a report that it's still not fixed.
Triaged this and it doesn't seem like a big problem.
There are 2 cases in the group:
- Without advanced sync rules - here we can modify the query, because we have control over it and optimise it
- With advanced sync rules - here the query is supplied to us, so we cannot run
LIMIT 0or any other variation of it to get column names, there is always potential for breakage. We can try to modify query and fall back to original approach too, but this might not be worth it.
Are we fetching the whole table?
No. Current code runs SELECT * FROM {table}, but it does result fetching with a cursor. What it means in essence is that we fetch only one page of data. This can still be huge and slow, but not as slow as "fetching the whole table". Additionally running query can be very inefficient too, but given the limitation for Advanced Sync Rules it can be really hard to fix both.
Conclusion
We can:
- Modify sync without advanced sync rules to run MySQLQueries.columns method and change the method - current implementation relies on user used for syncs to have access to INFORMATION_SCHEMA, which is a long shot
- Modify sync with advanced sync rules to try to run the query supplied with
LIMIT 0or do regex replacement for LIMIT and try to run if, if it won't work fall back to original method
All in all it doesn't look too bad to implement both - probably a work of couple hours.