stacks-blockchain-api Document the Postgres DB columns

Is your feature request related to a problem? Please describe. As part of our Stacks grant-winning product, the Novum team will need to perform some more advanced data querying over and above what is available as API endpoints. This will mean running SQL queries directly on the database which underlies the API node.

Describe the solution you'd like A per-table, per-column documentation of the data that gets stored in the underlying Postgres DB. This will allow users to interact with the rich dataset with much more confidence and fully understand what the data means. This is especially true for the binary DB fields the contents of which cannot be easily seen using a DB explorer software (eg. SequelPro or TablePlus)

Additional context We are building an exhaustive data dashboard for the stacks blockchain complete with a vast array of usage, health and trading signals. Grant application (successful): https://github.com/stacksgov/Stacks-Grants/issues/138

Aug 17 '21 14:08 albydeca

Hi @albydeca, a couple thoughts around this request:

The postgres schema is ever-evolving, and it's effectively an emergent property of the sql migration pattern used in this repo. This means that it would be somewhat challenging to document today's schema and ensure it stays updated as the codebase is updated. It would at least require writing some new types of tests where a bootstrapped postgres db can be tested against the documented schema, or possibly something where documentation could be autogenerated from the fully bootstrapped schema.

Here's an example to illustrate how sql migration patterns "define" a schema:

migration-file-4.ts defines a new column storing hex-encoded zonefiles as a string type.
A couple months later, migration-file-7.ts alters that column to a bytea (binary) format for performance purposes, and has some logic to perform the change on existing data.
Another couple months later, migration-file-12.ts drops that column entirely and moves all the existing data into a new table.
Another couple months later, the last migration decision is deemed problematic, and rolled back in migration-file-19.ts.

As you could imagine, this is challenging enough to deal with internally. Attempts to maintain docs and deal with external usage (compatibility, release notices, etc) would add unacceptable overhead.

Second, the API is structured so that the backing datastore is (at least theoretically) db agnostic. There's certainly a possible future where the DataStore interface could be storing data in different db or even multiple services like redis+postgres.

Most importantly, the postgres datastore is strictly an implementation that is internal to this repo, and I'm not comfortable making its schema a public interface that is intended to be consumed by other projects, maintaining additional documentation for the interface, dealing with breaking changes affecting other projects, etc.

I haven't looked at your project in detail, but you have a few possible solutions:

A. Depending on what you want to query from the API, create PRs with new sql queries and associated endpoints for the data you want. Just make sure to discuss them here first so we can make sure they aren't out-of-scope for this repo.

B. If what you need is out-of-scope, you can fork this repo for your own usages. Or even implement an alternative event-observer for the purposes of your project. The hardest part is handling re-orgs, but depending on your requirements, it may not be prohibitively difficult.

I'm happy to help answer any questions around the postgres columns, and I'm certainly if favor of adding more code-comments around the postgres migration files and sql queries for the purpose of improving the development experience within this repo. But I'm not sold on attempting to document the postgres schema (for public usage at least) and establishing it as something to be depended on as you have described.

cc @CharlieC3 @wileyj @rafaelcr for more opinions on the issue.

Aug 17 '21 19:08 zone117x

I don't think my or @wileyj's opinion on this matter should weigh as much as yours or @rafaelcr's, but I may be able to offer a bit more perspective.

There's already quite a few DB documentation tools that can auto-generate schema data into HTML and PDFs. For example, SchemaSpy is a pretty popular open-source one that supports psql. Sample db report here.

Thus, anyone should be able to generate their own db docs on-the-fly pretty quickly. They would just need to launch the API first so the DB gets created, then run a DB doc tool on it.

@albydeca are you able to do this to unblock yourself?

Additionally, this also seems pretty attainable for us to configure via Github actions, where a new commit to master would use something like SchemaSpy to generate updated docs and post them to the GH Pages for this repo, similar to the API docs. Once configured, I don't think it would incur any additional overhead unless something breaks (which is probably rare).

I think we would just have to make it clear that the API's DB schema is rapidly evolving, and that anyone wishing to create new projects depending on the API's DB schema is fully responsible for ensuring compatibility for their service, and as such should write and maintain tests for their service which validate against the API's DB schema.

Aug 17 '21 21:08 CharlieC3

I'm not opposed to @CharlieC3's suggestion if people would find that helpful -- it looks like SchemaSpy uses pg column comments to generate docs, and that appears to be supported in the pg migration lib we're using:

@albydeca does that work for you, given the above comments about maintenance? If so, I can add some initial comments for the binary columns and columns that you have explicit questions about.

Aug 17 '21 21:08 zone117x

Hi all, thanks for your suggestions.

I totally understand that the db schema will be constantly changing and agree that it makes sense to hold off creating official documentation if you do not wish for it to be depended on publicly. However the DB in itself holds a lot of extra interesting data that could benefit our, and other projects' analyses, and ultimately provide a great resource for the growing dev community.

As for our project, we are looking to implement both a web application and a series of charts/reports surrounding the Stacks blockchain and ecosystem to be published to our social media. We can deliver all the data and charts currently present on the mockup linked from the Github issue of our project, but we could potentially build more advanced charts if we (and I include non-technical colleagues of mine) understood the meaning of some columns including:

asset_event_type_id in stx_events, ft_events, nft_events
pruned, type_id, status, origin_hash_mode in mempool_txs
type_id, status, origin_hash_mode, event_count in txs
canonical

@zone117x I'd be more than happy to help you with contributions to the docs should you guys decide to go with SchemaSpy, which I think is a great idea, or any other tool really :)

Aug 18 '21 09:08 albydeca

I think next steps here are:

@CharlieC3 help with configuring SchemaSpy within gh actions to auto generate and publish (probably to a sub-directory in the github pages docs) some html.
@zone117x to add some initial column comments for the above mentioned columns.

Aug 24 '21 16:08 zone117x

hi guys, is there an update about this?

Sep 28 '21 13:09 albydeca

hi guys, is there an update about this?

@albydeca No updates at the moment. No need to wait for this to be closed though, feel free to give it a shot if you have the time!

Sep 28 '21 13:09 CharlieC3

stacks-blockchain-api stacks-blockchain-api copied to clipboard

Document the Postgres DB columns

stacks-blockchain-api
stacks-blockchain-api copied to clipboard