Prototoype for Datasette on PostgreSQL
I thought this would never happen, but now that I'm deep in the weeds of running SQLite in production for Datasette Cloud I'm starting to reconsider my policy of only supporting SQLite.
Some of the factors making me think PostgreSQL support could be worth the effort:
- Serverless. I'm getting increasingly excited about writable-database use-cases for Datasette. If it could talk to PostgreSQL then users could easily deploy it on Heroku or other serverless providers that can talk to a managed RDS-style PostgreSQL.
- Existing databases. Plenty of organizations have PostgreSQL databases. They can export to SQLite using db-to-sqlite but that's a pretty big barrier to getting started - being able to run
datasette postgresql://connection-stringand start trying it out would be a massively better experience. - Data size. I keep running into use-cases where I want to run Datasette against many GBs of data. SQLite can do this but PostgreSQL is much more optimized for large data, especially given the existence of tools like Citus.
- Marketing. Convincing people to trust their data to SQLite is potentially a big barrier to adoption. Even if I've convinced myself it's trustworthy I still have to convince everyone else.
- It might not be that hard? If this required a ground-up rewrite it wouldn't be worth the effort, but I have a hunch that it may not be too hard - most of the SQL in Datasette should work on both databases since it's almost all portable SELECT statements. If Datasette did DML this would be a lot harder, but it doesn't.
- Plugins! This feels like a natural surface for a plugin - at which point people could add MySQL support and suchlike in the future.
The above reasons feel strong enough to justify a prototype.
I'm excited about https://github.com/MagicStack/asyncpg for this - it's a true async PostgreSQL library (my SQLite queries run in a threadpool right now) with extremely impressive performance benchmarks, from the team behind uvloop.
A couple of things I'd like to support:
- The same datasette instance can have both PostgreSQL and SQLite databases attached to it, and both types will be listed on the homepage.
- The full test suite runs against both SQLite and PostgreSQL, with as few changes as possible (maybe a few
skipIf()decorators for features not available on both). Probably easiest do do this with some kind of flag - e.g.pytest --database=sqlitev.s.pytest --database=postgresql
I can implement that with this in conftest.py:
def pytest_addoption(parser):
parser.addoption("--database", action="store", default="sqlite")
See https://stackoverflow.com/questions/40880259/how-to-pass-arguments-in-pytest-by-command-line
The biggest difference between the two will be around introspection. I searched the codebase for potential introspection queries, defined as select ... from sqlite_master and pragma ... statements.
- 7 results for sqlite_master in database.py
- 3 results in utils/init.py
- (Ignoring the 4 in inspect.py)
- 1 result for “pragma table_info” in app.py
- 2 in utils/init.py
- 2 results for “pragma foreign_key_list” in utils/init.py
First page to get working: the database view page, which shows a list of tables.
In the code this is entirely implemented with calls to the Database class, which is a good starting point - I can drop in a new version of that class that talks to PostgreSQL instead:
https://github.com/simonw/datasette/blob/0091dfe3e5a3db94af8881038d3f1b8312bb857d/datasette/views/database.py#L25-L66
Introspecting columns in PostgreSQL:
In [3]: sql = """SELECT *
...: FROM information_schema.columns
...: WHERE table_schema = 'public'
...: AND table_name = 'blog_blogmark'
...: ;"""
Each column looks like this:
{'table_catalog': 'simonwillisonblog',
'table_schema': 'public',
'table_name': 'blog_blogmark',
'column_name': 'id',
'ordinal_position': 1,
'column_default': "nextval('blog_blogmark_id_seq'::regclass)",
'is_nullable': 'NO',
'data_type': 'integer',
'character_maximum_length': None,
'character_octet_length': None,
'numeric_precision': 32,
'numeric_precision_radix': 2,
'numeric_scale': 0,
'datetime_precision': None,
'interval_type': None,
'interval_precision': None,
'character_set_catalog': None,
'character_set_schema': None,
'character_set_name': None,
'collation_catalog': None,
'collation_schema': None,
'collation_name': None,
'domain_catalog': None,
'domain_schema': None,
'domain_name': None,
'udt_catalog': 'simonwillisonblog',
'udt_schema': 'pg_catalog',
'udt_name': 'int4',
'scope_catalog': None,
'scope_schema': None,
'scope_name': None,
'maximum_cardinality': None,
'dtd_identifier': '1',
'is_self_referencing': 'NO',
'is_identity': 'NO',
'identity_generation': None,
'identity_start': None,
'identity_increment': None,
'identity_maximum': None,
'identity_minimum': None,
'identity_cycle': 'NO',
'is_generated': 'NEVER',
'generation_expression': None,
'is_updatable': 'YES'}
Finding out the primary keys for a table: https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns
In [12]: await conn.fetch("""SELECT a.attname, format_type(a.atttypid, a.atttypmod) AS data_type
...: FROM pg_index i
...: JOIN pg_attribute a ON a.attrelid = i.indrelid
...: AND a.attnum = ANY(i.indkey)
...: WHERE i.indrelid = 'blog_blogmark'::regclass
...: AND i.indisprimary;""")
Out[12]: [<Record attname='id' data_type='integer'>]
Challenge: what's the equivalent for PostgreSQL of opening a database in read only mode? Will I have to talk users through creating read only credentials? Can I do this at runtime somehow? Can I detect if the connection has write permission and disable the arbitrary query feature?
Got the database page working! It lists tables, their columns and their row count.
Got the table page partially working! It can list rows.
It can't apply filters yet (because PostgreSQL $1 parameters differ from SQLite :blah parameters) and faceting doesn't work because PostgreSQL requires that subqueries have an alias:
subquery in FROM must have an alias HINT: For example, FROM (SELECT ...) [AS] foo.
Still a pretty promising start though!
I'm currently using PostgREST to serve OpenAPI APIs off Postgresql databases. I would like to try out datasette once this becomes available on Postgres.
Challenge: what's the equivalent for PostgreSQL of opening a database in read only mode? Will I have to talk users through creating read only credentials?
It looks like the answer to this is yes - I'll need users to setup read-only credentials. Here's a TIL about that: https://til.simonwillison.net/postgresql/read-only-postgresql-user
I've also been investigating serving postgresql databases over postgrest. I like the idea of hosting some static html + js on github, but having it backed by datasets I can update and control on the database server. I started from SQLite + datasette but would like to host larger datasets (with smaller materialized views exposed publicly). I think the postgrest model where all the authorization and ownership is defined in database role grants is really powerful. But I really miss being able to define an ad-hoc query in sql, then instantly link to a json representation of it like datasette does.
P.S.: I've been sort of following along as you pop up in hacker news here and there. It's been great! Thanks for doing this all out in the open!
Worth mentioning here: I've been doing a tun of research around running Datasette-like functionality against PostgreSQL in my https://github.com/simonw/django-sql-dashboard project - which will definitely inform the Datasette implementation.
Is there any working version of datasette/postgresql ?
Any progress on this? It would be very helpful on my end as well. Thanks!