sqlmesh icon indicating copy to clipboard operation
sqlmesh copied to clipboard

Use PandasCursor for Athena dataframes in fetchdf magic

Open williebsweet opened this issue 6 months ago • 4 comments

Only applies to the %%fetchdf magic command (although it could be expanded).

Replaces the generic pandas.read_sql_query() with PandasCursor for improved I/O performance.


With pandas.read_sql_query():

  1. Query executes in Athena and results are written to S3 as CSV files
  2. PyAthena connection fetches results row-by-row through the standard DB-API cursor interface
  3. Data comes through the AWS API using fetchall() or similar methods
  4. pandas constructs DataFrame from the row-by-row data

The key bottleneck is step 2-3: The data has to go through the PyAthena cursor's fetchall() method, which retrieves results row-by-row via AWS API calls.

With PandasCursor:

  1. Query executes in Athena and results are written to S3 as CSV files
  2. PandasCursor directly downloads the CSV file from S3
  3. pandas loads the CSV directly using optimized CSV parsing

The key advantage is step 2: Instead of going through AWS APIs row-by-row, PandasCursor downloads the entire result CSV file directly from S3 and then uses pandas' highly optimized CSV reading capabilities.


Anecdotally, I had queries that were taking 30+ min to execute that are now taking ~2 min.

williebsweet avatar Jul 30 '25 20:07 williebsweet

Oh, nice, Athena has some kind of "native" DataFrame provided by the library?

In that case, _fetch_native_df() should be implemented on AthenaEngineAdapter to return a pd.DataFrame that is backed by PandasCursor so that all fetchdf() calls can use it.

If it relies on an optional extra being available, and that extra is not available, then it can fall back to the existing/current logic from PandasNativeFetchDFSupportMixin

erindru avatar Jul 30 '25 23:07 erindru

@erindru Yeah - my testing of moving it to engine_adapter/athena.py got derailed by the Athena issue mentioned here.

I'm having issues with getting pre-commit to run. The docs, make file, CI commands, and environment don't seem to be "in sync". Is uv the default now?

williebsweet avatar Jul 31 '25 13:07 williebsweet

Nope, uv has never been the default, it just happens to be what that user was using.

Generally what we do is something like:

$ python -m venv sqlmesh-env
$ . ./sqlmesh-env/bin/activate
(sqlmesh-env) $ make install-dev 
(sqlmesh-env) $ make install-pre-commit #only have to do this once
# make some changes
(sqlmesh-env) $ git commit ....

The pre-commit hooks should run on git commit and abort the commit if they fail. You should also be able to run make style for something similar. The venv that you ran make install-dev needs to be active for all of these.

Is there a specific error you're getting?

erindru avatar Jul 31 '25 20:07 erindru

@erindru I was able to resolve my issue by not using uv.

But weirdly, make style and make py-style pass locally, but the latter failed in the last CI run.

williebsweet avatar Aug 01 '25 18:08 williebsweet