datasette
datasette copied to clipboard
datasette --get should efficiently handle streaming CSV
It would be great if you could use datasette --get
to run queries that return streaming CSV data without running out of RAM.
Current implementation looks like it loads the entire result into memory first: https://github.com/simonw/datasette/blob/f78ebdc04537a6102316d6dbbf6c887565806078/datasette/cli.py#L546-L552
It's using TestClient
at the moment which is a wrapper around httpx
(as of ) that uses the @async_to_sync
decorator to hide the async nature.
https://github.com/simonw/datasette/blob/f78ebdc04537a6102316d6dbbf6c887565806078/datasette/utils/testing.py#L102-L156
Maybe the fix here is to switch the --get
implementation to using httpx
directly with https://www.python-httpx.org/async/#streaming-responses
Built a research prototype:
diff --git a/datasette/app.py b/datasette/app.py
index 5c8101a..5cd3e63 100644
--- a/datasette/app.py
+++ b/datasette/app.py
@@ -1,6 +1,7 @@
import asyncio
import asgi_csrf
import collections
+import contextlib
import datetime
import functools
import glob
@@ -1490,3 +1491,11 @@ class DatasetteClient:
return await client.request(
method, self._fix(path, avoid_path_rewrites), **kwargs
)
+
+ @contextlib.asynccontextmanager
+ async def stream(self, method, path, **kwargs):
+ async with httpx.AsyncClient(app=self.app) as client:
+ print("async with as client")
+ async with client.stream(method, self._fix(path), **kwargs) as response:
+ print("async with client.stream about to yield response")
+ yield response
diff --git a/datasette/cli.py b/datasette/cli.py
index 3c6e1b2..3025ead 100644
--- a/datasette/cli.py
+++ b/datasette/cli.py
@@ -585,11 +585,19 @@ def serve(
asyncio.get_event_loop().run_until_complete(check_databases(ds))
if get:
- client = TestClient(ds)
- response = client.get(get)
- click.echo(response.text)
- exit_code = 0 if response.status == 200 else 1
- sys.exit(exit_code)
+
+ async def _run_get():
+ print("_run_get")
+ async with ds.client.stream("GET", get) as response:
+ print("Got response:", response)
+ async for chunk in response.aiter_bytes(chunk_size=1024):
+ print(" chunk")
+ sys.stdout.buffer.write(chunk)
+ sys.stdout.buffer.flush()
+ exit_code = 0 if response.status_code == 200 else 1
+ sys.exit(exit_code)
+
+ asyncio.get_event_loop().run_until_complete(_run_get())
return
# Start the server
But for some reason it didn't appear to stream out the response - it would print this out:
% datasette covid.db --get '/covid/ny_times_us_counties.csv?_size=10&_stream=on'
_run_get
async with as client
And then hang. I would expect it to start printing out chunks of CSV data here, but instead it looks like it waited for everything to be generated before returning anything to the console.
No idea why. I dropped this for the moment.