datasette icon indicating copy to clipboard operation
datasette copied to clipboard

Complete refactor of TableView and table.html template

Open simonw opened this issue 3 years ago • 45 comments

Split from #878. The current TableView class is by far the most complex part of Datasette, and the most difficult to work on: https://github.com/simonw/datasette/blob/0.59.2/datasette/views/table.py

In #878 I started exploring a new pattern for building views. In doing so it became clear that TableView is the first beast that I need to slay - if I can refactor that into something neat the pattern for building other views will emerge as a natural consequence.

I've been trying to build this as a register_routes() plugin, as originally suggested in #870 - though unfortunately it looks like those plugins can't replace existing Datasette default views at the moment, see #1517. [UPDATE: I was wrong about this, plugins can over-ride default views just fine]

I also know that I want to have a fully documented template context for table.html as a major step on the way to Datasette 1.0, see #1510.

All of this adds up to the TableView factor being a major project that will unblock a whole flurry of other things - so I'm going to work on that in this separate issue.

simonw avatar Nov 19 '21 02:11 simonw

Here's where I got to with my hacked-together initial plugin prototype - it managed to render the table page with some rows on it (and a bunch of missing functionality such as filters): https://gist.github.com/simonw/281eac9c73b062c3469607ad86470eb2

fixtures__roadside_attractions__4_rows_and__11__Liked___Twitter

simonw avatar Nov 19 '21 02:11 simonw

Ideally I'd like to execute the existing test suite against the new implementation - that would require me to solve this so I can replace the view with the plugin version though:

  • #1517

simonw avatar Nov 19 '21 02:11 simonw

I was wrong about that, you CAN over-ride default routes already.

simonw avatar Nov 19 '21 03:11 simonw

A (likely incomplete) list of features on the table page:

  • [ ] Display table/database/instance metadata
  • [ ] Show count of all results
  • [ ] Display table of results
    • [ ] Special table display treatment for URLs, numbers
    • [ ] Allow plugins to modify table cells
    • [ ] Respect ?_col= and ?_nocol=
  • [ ] Show interface for filtering by columns and operations
  • [ ] Show search box, support executing FTS searches
  • [ ] Sort table by specified column
  • [ ] Paginate table
  • [ ] Show facet results
  • [ ] Show suggested facets
  • [ ] Link to available exports
  • [ ] Display schema for table
    • [ ] Maybe it should show the SQL for the query too?
  • [ ] Handle various non-obvious querystring options, like ?_where= and ?_through=

simonw avatar Nov 19 '21 03:11 simonw

My goal is to break up a lot of this functionality into separate methods. These methods can be executed in parallel by asyncinject, but more importantly they can be used to build a much better JSON representation, where the default representation is lighter and ?_extra=x options can be used to execute more expensive portions and add them to the response.

So the HTML version itself needs to be re-written to use those JSON extras.

simonw avatar Nov 19 '21 03:11 simonw

Right now the HTML version gets to cheat - it passes through objects that are not JSON serializable, including custom functions that can then be called by Jinja.

I'm interested in maybe removing this cheating - if the HTML version could only request JSON-serializable extras those could be exposed in the API as well.

It would also help cleanup the kind-of-nasty pattern I use in the current BaseView where everything returns both a bunch of JSON-serializable data AND an awaitable function that then gets to add extra things to the HTML context.

simonw avatar Nov 19 '21 03:11 simonw

... and while I'm doing all of this I can rewrite the templates to not use those cheating magical functions AND document the template context at the same time, refs:

  • #1510.

simonw avatar Nov 19 '21 03:11 simonw

Very confused by this piece of code here: https://github.com/simonw/datasette/blob/1c13e1af0664a4dfb1e69714c56523279cae09e4/datasette/views/table.py#L37-L63

I added it in https://github.com/simonw/datasette/commit/754836eef043676e84626c4fd3cb993eed0d2976 - in the new world that should probably be replaced by pure JSON.

Aha - this comment explains it: https://github.com/simonw/datasette/issues/521#issuecomment-505279560

I think the trick is to redefine what a "cell_row" is. Each row is currently a list of cells:

https://github.com/simonw/datasette/blob/6341f8cbc7833022012804dea120b838ec1f6558/datasette/views/table.py#L159-L163

I can redefine the row (the cells variable in the above example) as a thing-that-iterates-cells (hence behaving like a list) but that also supports __getitem__ access for looking up cell values if you know the name of the column.

The goal was to support neater custom templates like this:

{% for row in display_rows %}
  <h2 class="scientist">{{ row["First_Name"] }} {{ row["Last_Name"] }}</h2>
  ...

This may be an argument for continuing to allow non-JSON-objects through to the HTML templates. Need to think about that a bit more.

simonw avatar Nov 19 '21 17:11 simonw

I'm going to try leaning into the asyncinject mechanism a bit here. One method can execute and return the raw rows. Another can turn that into the default minimal JSON representation. Then a third can take that (or take both) and use it to inflate out the JSON that the HTML template needs, with those extras and with the rendered cells from plugins.

simonw avatar Nov 19 '21 17:11 simonw

This may be an argument for continuing to allow non-JSON-objects through to the HTML templates. Need to think about that a bit more.

I can definitely support this using pure-JSON - I could make two versions of the row available, one that's an array of cell objects and the other that's an object mapping column names to column raw values.

simonw avatar Nov 19 '21 18:11 simonw

Two new requirements inspired by work on the datasette-table (and datasette-notebook) projects:

  • #1533
  • #1534

simonw avatar Nov 28 '21 21:11 simonw

I'm also going to use the new datasette-table Web Component to help guide the design of the new API, which relates directly to this issue too:

  • #1532

simonw avatar Nov 28 '21 21:11 simonw

Aside: is there any reason this work can't complete the long-running goal of merging the TableView and QueryView, such that most of the features available for tables become available for arbitrary queries too?

I had already mentally committed to implementing facets for queries, but I just realized that filters could work too - using either a CTE or a nested query.

Pagination is the one holdout here, since table pagination uses keyset pagination over a known order. But maybe arbitrary queries can only be paginated off you order them first?

simonw avatar Nov 28 '21 23:11 simonw

(I could experiment with merging the two tables by adding a temporary undocumented ?_sql= parameter to the in-progress table view that sets an alternative query instead of select cols from table - added bonus, this will force me to use introspection against the returned columns rather than mixing in the known columns for the specified table)

simonw avatar Nov 28 '21 23:11 simonw

If I break this up into @inject methods, what methods could I have and what would they do?

  • resolve_path: Use request path to resolve the database and table. Could handle hash URLs too (if I don't manage to extract those to a plugin) - would be nice if this could raise a redirect, but I think that will instead have to be one of the things it returns
  • build_sql: Builds the SQL query based on the querystring (and some DB introspection)
  • execute_count: Execute the count(*)
  • execute_rows: Execute the limit 101 to fetch the rows
  • execute_facets: Execute all requested facets (could this do its own asyncio.gather() to run facets in parallel?)
  • suggest_facets: Execute facet suggestions

Are there any plugin hooks that would make sense to execute in parallel? Actually there might be: I don't think extra_template_vars, extra_css_urls, extra_js_urls, extra_body_script depend on each other so it might be possible to execute them in a parallel chunk (at least any of them that return awaitables).

simonw avatar Dec 10 '21 20:12 simonw

I have a hunch that the conclusion of this experiment may end up being that the asyncinject trick is kinda neat but the code will be easier to maintain (while still executing in parallel) if it's written using asyncio.gather directly instead.

It's possible asyncinject will end up being neat enough that I'll want to keep it though.

simonw avatar Dec 12 '21 01:12 simonw

Rebuilding TableView from the ground up is proving not to be much fun. I'm going to explore starting the refactor of the existing code by separating out the bit that generates the SQL query from the rest of it.

simonw avatar Dec 12 '21 02:12 simonw

The tests for TableView are currently mixed in with everything else in tests/test_api.py and tests/html.py - might be good to split those out into test_table_html.py and test_table_api.py since they're such a key part of how Datasette works.

simonw avatar Dec 12 '21 02:12 simonw

I don't think this code is necessary any more: https://github.com/simonw/datasette/blob/492f9835aa7e90540dd0c6324282b109f73df71b/datasette/views/table.py#L396-L399

That dates back from when Datasette was built on top of Sanic and Sanic didn't preserve those query parameters the way I needed it to:

https://github.com/simonw/datasette/blob/1f69269fe93e4cd42e56890126cc0dbcf719c6cb/datasette/views/table.py#L202-L206

simonw avatar Dec 12 '21 03:12 simonw

No, removing that gave me the following test failure:

tests/test_table_api.py::test_table_filter_queries[/fixtures/simple_primary_key.json?content__exact=-expected_rows2] FAILED                                       [100%]

=============================================================================== FAILURES ================================================================================
______________________________________ test_table_filter_queries[/fixtures/simple_primary_key.json?content__exact=-expected_rows2] ______________________________________

app_client = <datasette.utils.testing.TestClient object at 0x10d45d2d0>, path = '/fixtures/simple_primary_key.json?content__exact=', expected_rows = [['3', '']]

    @pytest.mark.parametrize(
        "path,expected_rows",
        [
            ("/fixtures/simple_primary_key.json?content=hello", [["1", "hello"]]),
            (
                "/fixtures/simple_primary_key.json?content__contains=o",
                [
                    ["1", "hello"],
                    ["2", "world"],
                    ["4", "RENDER_CELL_DEMO"],
                ],
            ),
            ("/fixtures/simple_primary_key.json?content__exact=", [["3", ""]]),
            (
                "/fixtures/simple_primary_key.json?content__not=world",
                [
                    ["1", "hello"],
                    ["3", ""],
                    ["4", "RENDER_CELL_DEMO"],
                    ["5", "RENDER_CELL_ASYNC"],
                ],
            ),
        ],
    )
    def test_table_filter_queries(app_client, path, expected_rows):
        response = app_client.get(path)
>       assert expected_rows == response.json["rows"]
E       AssertionError: assert [['3', '']] == [['1', 'hello'],\n ['2', 'world'],\n ['3', ''],\n ['4', 'RENDER_CELL_DEMO'],\n ['5', 'RENDER_CELL_ASYNC']]
E         At index 0 diff: ['3', ''] != ['1', 'hello']
E         Right contains 4 more items, first extra item: ['2', 'world']
E         Full diff:
E           [
E         -  ['1',
E         -   'hello'],
E         -  ['2',
E         -   'world'],
E            ['3',
E             ''],
E         -  ['4',
E         -   'RENDER_CELL_DEMO'],
E         -  ['5',
E         -   'RENDER_CELL_ASYNC'],
E           ]

/Users/simon/Dropbox/Development/datasette/tests/test_table_api.py:511: AssertionError

simonw avatar Dec 12 '21 03:12 simonw

Idea: in JSON output include a warnings block listing any _ parameters that were not recognized.

simonw avatar Dec 12 '21 22:12 simonw

Built a new plugin to help with this work by improving the display of ?_trace=1 output: https://datasette.io/plugins/datasette-pretty-traces

image

simonw avatar Dec 13 '21 19:12 simonw

Useful old comment here: https://github.com/simonw/datasette/issues/617#issuecomment-552253893

As noted in #621 (comment) a common pattern in this method is blocks of code that append new items to the where_clauses, params and extra_human_descriptions arrays. This is a useful refactoring opportunity.

Code that fits this pattern:

  • The code that builds based on the filters: where_clauses, params = filters.build_where_clauses(table) and human_description_en = filters.human_description_en(extra=extra_human_descriptions)
  • Code that handles ?_where=: where_clauses.extend(request.args["_where"]) - though note that this also appends to a extra_wheres_for_ui array which nothing else uses
  • The _through= code, see Syntax for ?_through= that works as a form field #621 for details
  • The code that deals with ?_search= FTS

The keyset pagination code modifies where_clauses and params too, but I don't think it's quite going to work with the same abstraction that would cover the above examples.

simonw avatar Dec 13 '21 23:12 simonw

  • table_actions should be an extra.

simonw avatar Dec 14 '21 17:12 simonw

Maybe a better way to approach this would be to focus on the JSON side of things - try to get a basic JSON version with ?_extra= support working, then eventually build that up to the point where it can power the HTML version.

simonw avatar Dec 14 '21 21:12 simonw

There are actually four forms of SQL query used by the table page:

  • from_sql - just the from table_name where ...
  • sql_no_order_no_limit - used for faceting, "select {select_all_columns} from {table_name} {where}"
  • sql - the above but with order and limit clauses: "select {select_specified_columns} from {table_name} {where}{order_by} limit {page_size}{offset}"
  • count_sql used for the count, built out of from_sql: "select count(*) {from_sql}"

I'm tempted to encapsulate those in a Query class.

simonw avatar Dec 14 '21 22:12 simonw

Should facets really not be displayed on pages past page one (where ?_next= is set)? That made sense to me at the time, but I'm now having second thoughts about it.

I guess it's a useful performance tweak for when crawlers keep hitting the ?_next= link.

Actually it looks like facets DO display on subsequent pages, e.g. on https://global-power-plants.datasettes.com/global-power-plants/global-power-plants?_next=200 - but facet suggestions do not, thanks to this code: https://github.com/simonw/datasette/blob/2c07327d23d9c5cf939ada9ba4091c1b8b2ba42d/datasette/views/table.py#L777-L785

simonw avatar Dec 16 '21 21:12 simonw

A fundamental operation of this view is to construct the SQL query and accompanying human description based on the incoming query string parameters.

The human description is the bit at the top of https://latest.datasette.io/fixtures/searchable?_search=dog&_sort=pk&_facet=text2&text2=sara+weasel that says:

1 row where search matches "dog" and text2 = "sara weasel" sorted by pk

(Also used in the page <title>).

The code actually gathers three things:

  • Fragments of the where clause, for example "text2" = :p0
  • Parameters, e.g. {"p0": "sara weasel"}
  • Human description components, e.g. text2 = "sara weasel"

Some operations such as ?_where= don't currently provide an extra human description component.

_where= also doesn't populate a parameter, but maybe it could? Would be neat if in the future ?_where=foo+=+:bar worked and added a bar input field to the screen, as seen with custom queries.

simonw avatar Dec 16 '21 21:12 simonw

I added a ton of comments to the data() method which really helps get a better feel for how this all works: https://github.com/simonw/datasette/blob/0663d5525cc41e9260ac7d1f6386d3a6eb5ad2a9/datasette/views/table.py#L322

simonw avatar Dec 16 '21 21:12 simonw

Is there an opportunity to refactor things using a new plugin hook here? Maybe the register_filters hook from #473, where the hook becomes responsible for building where clauses (and human descriptions of them) based on the incoming query string.

That version dealt with Filter classes, but those might be a bit too low-level for this.

?_spatial_within=GEOJSON was an interesting idea attached to that issue.

simonw avatar Dec 16 '21 22:12 simonw