graph-node New DB representation for data sources

The DB storage of dynamic data sources needs to change to reflect that graph node is now multichain and to fullfill the requirements of file data sources.

Dynamic data sources are currently stored in sugraphs.dynamic_ethereum_contract_data_source. This proposes that each subgraph will gain a sgd*.data_source table under its own DB namespace. The proposed schema for this table is:

create table sgd*.data_sources$ (
      vid integer generated by default as identity primary key,
      block_range int4range not null,
      access_group integer generated by default as identity,
      manifest_idx integer not null,
      parent integer references sgd*.data_source,
      id bytea,
      param bytea,
      context jsonb,
  }

Each field is explained and justified below. This is a good time to bikeshed on all of their names.

vid: Just a primary key, has no particular meaning.

block_range: The use of a block range is meant to allow an efficent implementation of data source removal, which is not currently a feature but quickly becomes necessary for use cars such as replacing the IPFS file associated with a user profile. Since we antecipate that the query "which data sources are live at this block" will be relevant, a range makes sense compared to separate start and end block fields.

access_group : Works as described here in the 'data source indepedency' section. This will also be added to entity tables. We let the DB handle generating new access groups from a sequence when necessary, since these need not be deterministic.

manifest_idx : Serves the purposes that name currently serves, to look up the definition in the manifest. This more efficient representation stores the position in the manifest rather than the name.

id: An optional user given id, in case the user needs to refer to a specfic data source after creation, such as for removal.

param: The stored representation of the params argument passed on creation. Examples of stored data: Address for Ethereum, account id for near, file CID for file data sources. A bytea representation is optimal for current uses, and we can always add a jsonb alternative if required in the future.

context: Optional data soure context, as it exists today.

parent: The data source which triggered the craetion of this data source. Tracking parent-child wil relationships will likely be required for features such as ipfs data sources created from other data sources. It's assumed in this design that each data source has a single parent.

Unblocks: #3072 #3202

Ping @lutter @maoueh for review.

Mar 29 '22 16:03 leoyvens

This looks good. A few small comments:

the table should be called data_source$ since $ is legal in PG identifiers and can't clash with names coming from GraphQL (otherwise the table name could conflict with an entity type DataSource)
does access_group need to be an int8 or could we get away with an int4?
similarly, could manifest_idx be a int2?
will id be used to search for data sources? Could we get away with a bytea for it?

Mar 29 '22 19:03 lutter

The access_group space needs to be as large as the data source space. But I think we should make vid, access_group and parent all int4, data sources are all kept in memory so it seems unrealistic to support billions of them, even if we're aiming to support millions. Making manifest_idx and int2 is a small saving and it's not unthinkable that someone would generate a manifest with over 32k data sources.

About id, this is a DX question since we're choosing between String or Bytes when we eventually have an AS interface for setting it. Using Bytes does seem like it will may be a common case, for storing CIDs or contract addresses, and strings can always be converted to bytes, so I agree on making it bytea.

Mar 30 '22 11:03 leoyvens

That's pretty good, how will migrations be handled?

Apr 08 '22 19:04 maoueh

My current thinking is to not migrate existing subgraphs, so graph node will continue to support the existing schema.

Apr 08 '22 19:04 leoyvens

Another option for migration would be to do it during start_subgraph_deployment: if the data_sources table does not exist yet, create it and populate it from the current table. It might be that not having to accommodate looking in two different storage schemes helps keep the code simpler.

Apr 12 '22 23:04 lutter

It would make the code cleaner, though it is adds risk to the migration itself. One thing to consider is that when doing the migration we need to block all indexing that could be happening concurrently, and make sure the index nodes are updated before continuing. And the downtime this migration could impose on hosted, due to the number of subgraphs, might not be acceptable.

Apr 13 '22 13:04 leoyvens

That's why I suggested to do it in start_subgraph_deployment since you know indexing isn't happening at that point. Yes, it will slow down startup; you are right that it'll be hard to predict what the load on the system will be, we'd have to figure out a way to quantify that. I don't think it'll introduce a lot of downtime, since the writes that need to happen are not that different from what a subgraph does when it writes 'normal' entitites.

Apr 13 '22 14:04 lutter

Ah I see now. Then a remaining issue would be if we should have code to revert the migration ready, or if we should rely on the integration testing to give us enough confidence to migrate without the option of going back.

Apr 13 '22 15:04 leoyvens

My concern is on the backwards compatibility with previous graph node versions. It's been a while since we've made a change that breaks the possibility of downgrading. And it's dangerous if downgrading corrupts the DB. The migration doesn't really need to be reverted since it's not destructive, it only copies data to a new table. One way to maintain backwards compatibility would be to write ethereum data sources both to the legacy sugraphs.dynamic_ethereum_contract_data_source and the new table.

Apr 19 '22 15:04 leoyvens

I think this data is only relevant during indexing, so could this be handled by apiVersioning? (file data sources will necessitate a new apiVersion)

Apr 22 '22 09:04 azf20

This change can be made without changing any existing behaviour, so it doesn't need a new api version.

Apr 22 '22 10:04 leoyvens

Yes, but the functionality which requires it (file data sources) will require a new api version. If we want to maintain backwards compatibility, either the new Graph Node versions write to the legacy table, or versioning determines which table is written to for a given subgraph (older Graph Node versions won't be able to index the new apiVersion anyway, so we don't need to worry about rolling back) I don't feel strongly, just felt like this was an alternative path.. cc @dizzyd who I was talking to yesterday about the various versions in Graph Node

Apr 22 '22 11:04 azf20

That's possible, but makes it less likely that we'll ever remove the code path which supports the legacy table.

Apr 22 '22 11:04 leoyvens

After discussions with @lutter, the latest plan is:

Both schema versions will be supported in graph node, at least until all of hosted is migrated and the migration has been included in a release.
This means supporting both schemas for both indexing and grafting subgraphs.
The migration in hosted will be triggered on a per-shard configuration, so we have a chance to restore from a backup if something goes very wrong.

Apr 27 '22 18:04 leoyvens

@leoyvens, can our issue be related to this one ?

Deploy same subgraph to multiple chains using single database #3580

May 26 '22 08:05 air3ijai

This is implemented for new deployments. The remaining work tracked by this issue would be to migrate existing deployments to the new schema.

Aug 04 '22 19:08 leoyvens

It's been a year and no particular need to actively move subgraphs to the new schema has been identified, so I'm closing.

Dec 21 '23 20:12 leoyvens

graph-node graph-node copied to clipboard

New DB representation for data sources

graph-node
graph-node copied to clipboard