datasette icon indicating copy to clipboard operation
datasette copied to clipboard

datasette --get should efficiently handle streaming CSV

Open simonw opened this issue 3 years ago • 2 comments

It would be great if you could use datasette --get to run queries that return streaming CSV data without running out of RAM.

Current implementation looks like it loads the entire result into memory first: https://github.com/simonw/datasette/blob/f78ebdc04537a6102316d6dbbf6c887565806078/datasette/cli.py#L546-L552

simonw avatar Jun 03 '21 04:06 simonw

It's using TestClient at the moment which is a wrapper around httpx (as of ) that uses the @async_to_sync decorator to hide the async nature.

https://github.com/simonw/datasette/blob/f78ebdc04537a6102316d6dbbf6c887565806078/datasette/utils/testing.py#L102-L156

Maybe the fix here is to switch the --get implementation to using httpx directly with https://www.python-httpx.org/async/#streaming-responses

simonw avatar Jun 03 '21 04:06 simonw

Built a research prototype:

diff --git a/datasette/app.py b/datasette/app.py
index 5c8101a..5cd3e63 100644
--- a/datasette/app.py
+++ b/datasette/app.py
@@ -1,6 +1,7 @@
 import asyncio
 import asgi_csrf
 import collections
+import contextlib
 import datetime
 import functools
 import glob
@@ -1490,3 +1491,11 @@ class DatasetteClient:
             return await client.request(
                 method, self._fix(path, avoid_path_rewrites), **kwargs
             )
+
+    @contextlib.asynccontextmanager
+    async def stream(self, method, path, **kwargs):
+        async with httpx.AsyncClient(app=self.app) as client:
+            print("async with as client")
+            async with client.stream(method, self._fix(path), **kwargs) as response:
+                print("async with client.stream about to yield response")
+                yield response
diff --git a/datasette/cli.py b/datasette/cli.py
index 3c6e1b2..3025ead 100644
--- a/datasette/cli.py
+++ b/datasette/cli.py
@@ -585,11 +585,19 @@ def serve(
     asyncio.get_event_loop().run_until_complete(check_databases(ds))
 
     if get:
-        client = TestClient(ds)
-        response = client.get(get)
-        click.echo(response.text)
-        exit_code = 0 if response.status == 200 else 1
-        sys.exit(exit_code)
+
+        async def _run_get():
+            print("_run_get")
+            async with ds.client.stream("GET", get) as response:
+                print("Got response:", response)
+                async for chunk in response.aiter_bytes(chunk_size=1024):
+                    print("   chunk")
+                    sys.stdout.buffer.write(chunk)
+                    sys.stdout.buffer.flush()
+                exit_code = 0 if response.status_code == 200 else 1
+                sys.exit(exit_code)
+
+        asyncio.get_event_loop().run_until_complete(_run_get())
         return
 
     # Start the server

But for some reason it didn't appear to stream out the response - it would print this out:

% datasette covid.db --get '/covid/ny_times_us_counties.csv?_size=10&_stream=on'
_run_get
async with as client

And then hang. I would expect it to start printing out chunks of CSV data here, but instead it looks like it waited for everything to be generated before returning anything to the console.

No idea why. I dropped this for the moment.

simonw avatar Mar 20 '22 22:03 simonw