Proposal: datasette query
I started sketching out a plugin to add a datasette query subcommand to export data from the command line. This is based on discussions in #1356 and #1605. Before I get too far down this rabbit hole, I figure it's worth getting some feedback here (unless this should happen in Discussions). Here's what I'm thinking:
At its most basic, it will write the results of a query to STDOUT.
datasette query -d data.db 'select * from data' > results.json
This isn't much improvement over using sqlite-utils. To make better use of datasette and its ecosystem, run datasette query using a canned query defined in a metadata.yml file.
For example, using the metadata file from alltheplaces-datasette:
cd alltheplaces-datasette
datasette query -d alltheplaces.db -m metadata.yml count_by_spider
That query would be good to get as CSV, and we can auto-discover metadata and databases in the current directory:
cd alltheplaces-datasette
datasette query count_by_spider -f csv
In this case, count_by_spider is a canned query defined on the alltheplaces database. If the same query is defined on multiple databases or its otherwise unclear which database query should use, pass the -d or --database option.
If a query takes parameters, I can pass them in at runtime, using the --param or -p option:
datasette query -d data.db -p value something 'select * from neighborhoods where some_column = :value'
I'm very interested in feedback on this, including whether it should be a plugin or in Datasette core. (I don't have a strong opinion about this, but I'm prototyping it as a plugin to start.)
If we do this I'm keen to have it be more than just an alternative to the existing sqlite-utils command - especially since if I add sqlite-utils as a dependency of Datasette in the future that command will be installed as part of pip install datasette anyway.
My best thought on how to differentiate them so far is plugins: if Datasette plugins that provide alternative outputs - like .geojson and .yml and suchlike - also work for the datasette query command that would make a lot of sense to me.
One way that could work: a --fmt geojson option to this command which uses the plugin that was registered for the specified extension.
My best thought on how to differentiate them so far is plugins: if Datasette plugins that provide alternative outputs - like .geojson and .yml and suchlike - also work for the datasette query command that would make a lot of sense to me.
That's my thinking, too. It's really the thing I've been wanting since writing datasette-geojson, since I'm always exporting with datasette --get. The workflow I'm always looking for is something like this:
cd alltheplaces-datasette
datasette query dunkin_in_suffolk -f geojson -o dunkin_in_suffolk.geojson
I think this probably needs either a new plugin hook separate from register_output_renderer or a way to use that without going through the HTTP stack. Or maybe a render mode that writes to a stream instead of a response. Maybe there's a new key in the dictionary that register_output_renderer returns that handles CLI exports.
A render mode for that plugin hook that writes to a stream is exactly what I have in mind:
- #1062
So maybe render_output_render returns something like this:
@hookimpl
def register_output_renderer(datasette):
return {
"extension": "geojson",
"render": render_geojson,
"stream": stream_geojson,
"can_render": can_render_geojson,
}
And stream gets an iterator, instead of a list of rows, so it can efficiently handle large queries. Maybe it also gets passed a destination stream, or it returns an iterator. I'm not sure what makes more sense. Either way, that might cover both CLI exports and streaming responses.
And just to think this through a little more, here's what stream_geojson might look like:
async def stream_geojson(datasette, columns, rows, database, stream):
db = datasette.get_database(database)
for row in rows:
feature = await row_to_geojson(row, db)
stream.write(feature + "\n") # just assuming newline mode for now
Alternately, that could be an async generator, like this:
async def stream_geojson(datasette, columns, rows, database):
db = datasette.get_database(database)
for row in rows:
feature = await row_to_geojson(row, db)
yield feature
Not sure which makes more sense, but I think this pattern would open up a lot of possibility. If you had your stream_indented_json function, you could do yield from stream_indented_json(rows, 2) and be one your way.
Was looking through old issues and realized a bunch of this got discussed in #1101 (including by me!), so sorry to rehash all this. Happy to help with whatever piece of it I can. Would be very excited to be able to use format plugins with exports.