connectors icon indicating copy to clipboard operation
connectors copied to clipboard

MySQL connector takes a long time to start when working with large tables

Open artem-shelkovnikov opened this issue 2 years ago • 6 comments

Bug Description

When running some internal testing against a huge MySQL table (100M+ records) we've found out that before the sync actually starts, connector hangs for a long time (10s of minutes).

To Reproduce

Steps to reproduce the behavior:

  1. Create a reasonably large MySQL table - for example one with 100M records with text
  2. Try to run a sync against this MySQL instance + table
  3. Observe logs and check how much time passes between the sync start and actual first document ingested

Expected behavior

First document is ingested reasonably fast.

artem-shelkovnikov avatar Jun 26 '23 15:06 artem-shelkovnikov

Preliminary investigation showed that this method is taking too long to execute:

    @retryable(
        retries=RETRIES,
        interval=RETRY_INTERVAL,
        strategy=RetryStrategy.EXPONENTIAL_BACKOFF,
    )
    async def get_column_names_for_query(self, query):
        async with self.connection.cursor(aiomysql.cursors.SSCursor) as cursor:
            await cursor.execute(query)

            return [f"{column[0]}" for column in cursor.description]

Query is SELECT * FROM {table} - probably it actually tries to fetch all data into memory?

Potentially it can also cause OOMs on cloud that prevents from multiple syncs to happen.

artem-shelkovnikov avatar Jun 26 '23 15:06 artem-shelkovnikov

Nice find. Yes, this seems like a bug we should fix sooner than later.

CC @danajuratoni for prioritization.

seanstory avatar Jun 26 '23 16:06 seanstory

Sample dataset could be taken from here: https://microsoft.github.io/msmarco/Datasets.html

artem-shelkovnikov avatar Jun 26 '23 23:06 artem-shelkovnikov

We could have a "LIMIT 1" in that query and fix this.

erikcurrin-elastic avatar Mar 18 '25 16:03 erikcurrin-elastic

Yup, easy fix that'll take longer to test - we don't have a solid dataset to verify the problem

artem-shelkovnikov avatar Mar 18 '25 17:03 artem-shelkovnikov

I think we could just trust that a LIMIT 1 would be sufficient, or anything that doesn't try to fetch the whole dataset in one go. As long as we can test that it doesn't break anything on a small dataset size, worst case scenario is we have to reopen this if we get a report that it's still not fixed.

seanstory avatar Mar 28 '25 19:03 seanstory

Triaged this and it doesn't seem like a big problem.

There are 2 cases in the group:

  • Without advanced sync rules - here we can modify the query, because we have control over it and optimise it
  • With advanced sync rules - here the query is supplied to us, so we cannot run LIMIT 0 or any other variation of it to get column names, there is always potential for breakage. We can try to modify query and fall back to original approach too, but this might not be worth it.

Are we fetching the whole table?

No. Current code runs SELECT * FROM {table}, but it does result fetching with a cursor. What it means in essence is that we fetch only one page of data. This can still be huge and slow, but not as slow as "fetching the whole table". Additionally running query can be very inefficient too, but given the limitation for Advanced Sync Rules it can be really hard to fix both.

Conclusion

We can:

  1. Modify sync without advanced sync rules to run MySQLQueries.columns method and change the method - current implementation relies on user used for syncs to have access to INFORMATION_SCHEMA, which is a long shot
  2. Modify sync with advanced sync rules to try to run the query supplied with LIMIT 0 or do regex replacement for LIMIT and try to run if, if it won't work fall back to original method

All in all it doesn't look too bad to implement both - probably a work of couple hours.

artem-shelkovnikov avatar Jun 24 '25 13:06 artem-shelkovnikov