graph-node
graph-node copied to clipboard
New DB representation for data sources
The DB storage of dynamic data sources needs to change to reflect that graph node is now multichain and to fullfill the requirements of file data sources.
Dynamic data sources are currently stored in sugraphs.dynamic_ethereum_contract_data_source
. This proposes that each subgraph will gain a sgd*.data_source
table under its own DB namespace. The proposed schema for this table is:
create table sgd*.data_sources$ (
vid integer generated by default as identity primary key,
block_range int4range not null,
access_group integer generated by default as identity,
manifest_idx integer not null,
parent integer references sgd*.data_source,
id bytea,
param bytea,
context jsonb,
}
Each field is explained and justified below. This is a good time to bikeshed on all of their names.
vid
: Just a primary key, has no particular meaning.
block_range
: The use of a block range is meant to allow an efficent implementation of data source removal, which is not currently a feature but quickly becomes necessary for use cars such as replacing the IPFS file associated with a user profile. Since we antecipate that the query "which data sources are live at this block" will be relevant, a range makes sense compared to separate start and end block fields.
access_group
: Works as described here in the 'data source indepedency' section. This will also be added to entity tables. We let the DB handle generating new access groups from a sequence when necessary, since these need not be deterministic.
manifest_idx
: Serves the purposes that name
currently serves, to look up the definition in the manifest. This more efficient representation stores the position in the manifest rather than the name.
id
: An optional user given id, in case the user needs to refer to a specfic data source after creation, such as for removal.
param
: The stored representation of the params
argument passed on creation. Examples of stored data: Address for Ethereum, account id for near, file CID for file data sources. A bytea
representation is optimal for current uses, and we can always add a jsonb
alternative if required in the future.
context
: Optional data soure context, as it exists today.
parent
: The data source which triggered the craetion of this data source. Tracking parent-child wil relationships will likely be required for features such as ipfs data sources created from other data sources. It's assumed in this design that each data source has a single parent.
Unblocks: #3072 #3202
Ping @lutter @maoueh for review.
This looks good. A few small comments:
- the table should be called
data_source$
since$
is legal in PG identifiers and can't clash with names coming from GraphQL (otherwise the table name could conflict with an entity typeDataSource
) - does
access_group
need to be anint8
or could we get away with anint4
? - similarly, could
manifest_idx
be aint2
? - will
id
be used to search for data sources? Could we get away with abytea
for it?
The access_group
space needs to be as large as the data source space. But I think we should make vid
, access_group
and parent
all int4
, data sources are all kept in memory so it seems unrealistic to support billions of them, even if we're aiming to support millions. Making manifest_idx
and int2
is a small saving and it's not unthinkable that someone would generate a manifest with over 32k data sources.
About id
, this is a DX question since we're choosing between String
or Bytes
when we eventually have an AS interface for setting it. Using Bytes
does seem like it will may be a common case, for storing CIDs or contract addresses, and strings can always be converted to bytes, so I agree on making it bytea
.
That's pretty good, how will migrations be handled?
My current thinking is to not migrate existing subgraphs, so graph node will continue to support the existing schema.
Another option for migration would be to do it during start_subgraph_deployment
: if the data_sources
table does not exist yet, create it and populate it from the current table. It might be that not having to accommodate looking in two different storage schemes helps keep the code simpler.
It would make the code cleaner, though it is adds risk to the migration itself. One thing to consider is that when doing the migration we need to block all indexing that could be happening concurrently, and make sure the index nodes are updated before continuing. And the downtime this migration could impose on hosted, due to the number of subgraphs, might not be acceptable.
That's why I suggested to do it in start_subgraph_deployment
since you know indexing isn't happening at that point. Yes, it will slow down startup; you are right that it'll be hard to predict what the load on the system will be, we'd have to figure out a way to quantify that. I don't think it'll introduce a lot of downtime, since the writes that need to happen are not that different from what a subgraph does when it writes 'normal' entitites.
Ah I see now. Then a remaining issue would be if we should have code to revert the migration ready, or if we should rely on the integration testing to give us enough confidence to migrate without the option of going back.
My concern is on the backwards compatibility with previous graph node versions. It's been a while since we've made a change that breaks the possibility of downgrading. And it's dangerous if downgrading corrupts the DB. The migration doesn't really need to be reverted since it's not destructive, it only copies data to a new table. One way to maintain backwards compatibility would be to write ethereum data sources both to the legacy sugraphs.dynamic_ethereum_contract_data_source
and the new table.
I think this data is only relevant during indexing, so could this be handled by apiVersioning? (file data sources will necessitate a new apiVersion)
This change can be made without changing any existing behaviour, so it doesn't need a new api version.
Yes, but the functionality which requires it (file data sources) will require a new api version. If we want to maintain backwards compatibility, either the new Graph Node versions write to the legacy table, or versioning determines which table is written to for a given subgraph (older Graph Node versions won't be able to index the new apiVersion anyway, so we don't need to worry about rolling back) I don't feel strongly, just felt like this was an alternative path.. cc @dizzyd who I was talking to yesterday about the various versions in Graph Node
That's possible, but makes it less likely that we'll ever remove the code path which supports the legacy table.
After discussions with @lutter, the latest plan is:
- Both schema versions will be supported in graph node, at least until all of hosted is migrated and the migration has been included in a release.
- This means supporting both schemas for both indexing and grafting subgraphs.
- The migration in hosted will be triggered on a per-shard configuration, so we have a chance to restore from a backup if something goes very wrong.
@leoyvens, can our issue be related to this one ?
Deploy same subgraph to multiple chains using single database #3580
This is implemented for new deployments. The remaining work tracked by this issue would be to migrate existing deployments to the new schema.
It's been a year and no particular need to actively move subgraphs to the new schema has been identified, so I'm closing.