graph-node
graph-node copied to clipboard
Multiple data sources
Summary
So far we only support Ethereum contracts as data sources and event handlers as mappings. As a consequence, the code that wires up subgraphs for indexing is kind of special-cased (e.g. everything is based on the block stream). We aim to extend subgraphs to support other data sources like IPFS files and make it easier to add more data source types over time.
This issue describes the requirements for supporting multiple data source types and the changes proposed to implement them.
Requirements
- Need to support multiple data source types (e.g.
ethereum/contract,ipfs/fileetc.). - Supported data source types have some fields in common (like
kind,nameandmapping); other fields are unique to the data source type. - Data sources of all types are started and stopped with the subgraph deployment.
- Data sources require their own state.
- The overall progress of a subgraph needs to be calculated from all of its data sources.
Proposed Changes
I propose that the implementation takes place in two phases. First, implement block streaming, state management and progress information per data source rather than per subgraph. Then, extend the system to supporting generic (potentially different) data sources.
Phase 1
- Make
SubgraphInstanceManagerandSubgraphInstanceinstantiate aDataSourceInstancefor every data source. - Add a
DataSourceIndexertrait that is responsible for indexing a particular type of data source. Add anEthereumContractInstanceversion that operates based on a block stream. - Add
Storehelper to read the state of a data source of a subgraph as aFrom<serde_json::Value>or similar. - Change block pointers to be for a data source instead a subgraph.
Phase 2
- Extend the GraphQL schema for the subgraph of subgraphs to support data sources of different types, either through interfaces or unions.
- Turn data sources and mappings in
SubgraphManifestinto either an enum or come up with builder-style pattern for creatingDataSourceInstances from data sources. - Make
SubgraphRegistrarwrite different types of data sources to the store.
Does each EthereumContractInstance have it's own block stream? I see two options, given the requirement that events are processed block by block, and within each block in a specific data source order.
- What we do right now, which can be described as:
for each block in blockstream:
for each datasource in blockstream.datasources:
process(block, datasource)
- If each datasource holds a block stream, we might have to flip to:
for each datasource in datasources:
let block = datasource.blockstream.next();
process(block, datasource)
What we have right now with the single block stream seems better.
@Zerim What's your latest thinking about dependencies between data sources, like order and overlap in entities etc.?
I think even if entities are disjoint between data sources, it would still be very weird if data source A was at block 5,000,000 and data source B was at block 3,000,000. So you're right @leodasvacas, they can't be independent (like I was thinking). We can think of the block stream as a way of synchronizing data sources that originate from the same blockchain (kinda like a master clock).
The only tricky part here is that we need to make sure to re-scan the current block and handle new evens when a dynamic data source is added (see #719). That's a problem we can solve though.
I still propose DataSourceInstance to encapsulate the processing logic for each data source type in a dedicated type. I'd go one step further even and have the default DataSourceInstance trait be stateless and extend it with a StatefulDataSourceInstance. That way some data sources can have their own state (like IPFS file streams) and others will purely rely on their blockchain adapter's stream of information.
Let me back out of the StatefulDataSourceInstance suggestion. I think it's abstracting too early.
Here's my current thinking of how we can rework the current codebase:
-
Add the following types:
struct BlockStreamBuilders { ethereum: Arc<BlockStreamBuilder<EthereumBlockStream>>, ... } struct DataSourceAdapters { ethereum: Arc<EthereumAdapter>, ipfs: Arc<IpfsAdapter>, // once we add IPFS bulk import } enum DataSourceInstance { EthereumContract(EthereumContractDataSourceInstance), IpfsFile(IpfsFileDataSourceInstance), // once we add IPFS bulk import }The data source under
SubgraphManifestalso becomes an enum with the same kind of variants. -
Make
SubgraphInstanceManagertake aBlockStreamBuildersstruct and aDataSourceAdaptersstruct (in addition to theStoreetc.). -
When a subgraph is created:
- Create a new
SubgraphInstancewith aSubgraphInstanceBuilder. Pass the builders and adapters to it, as well as aCancelablefor when the subgraph deployment is removed/unassigned. - Instantiate all its data sources.
- Spawn all IPFS file data sources as standalone, cancelable Tokio tasks.
- Collect Ethereum block stream information from data sources and, if not empty, build and spawn an Ethereum block stream, passing all events on to the
SubgraphInstance.
- Create a new
Subgraph of subgraphs
In the subgraph of subgraphs, I propose the following change: Instead of
subgraphDeployment(id: ...) {
latestEthereumBlockNumber
latestEthereumBlockHash
totalEthereumBlockCount
}
organize the progress information per stream and add a dataSources field for progress information of data sources that have their own state:
subgraphDeployment(id: ...) {
streams {
ethereum {
latestBlockNumber
latestBlockHash
totalBlockNumber
}
}
dataSources { // a [DataSourceState!]! value
id
... on IPFSFileDataSource {
bytesRead
totalBytes
}
}
}
How to add new data source types
- Add a new variant to the data source enum under
SubgraphManifest; implement writing this variant to the database as an entity. - Implement a new
DataSourceInstanceenum variant for instances of the new data source type. - If necessary, add a new
BlockStreamBuilderfor the new blockchain / data source type. - In
SubgraphInstanceManager, wire up data sources of the new type
As we were discussing on Discord, there must be a sequence of events across data sources so that a subgraph can behave deterministically. Right now this is implicitly Ethereum block numbers, but we'll need to abstract that. Before figuring out the code, it's worth abstracting the mental model.
What about this for basic principles:
- Processing within a subgraph is based on trigger events.
- All events must map to a sequence number (e.g. a timestamp, an Ethereum block number).
- A data source exposes a stream of trigger events, emitted in monotonically non-decreasing order of sequence numbers.
- Events from all data sources are processed in order of:
- Sequence number,
- If the sequence number is equal, events are processed by order of data source, since data sources have a strict order implied in the manifest.
- If the sequence number and data source are equal, the event is processed in order of emission.
- Processing of events for a sequence number is atomic. We only commit to the store when all processing for a sequence number is done. A data source can trigger a revert to a previous sequence number, for example due to a proof-of-work reorg.
My preference would be to make data sources responsible for disjoint entity types, when it comes to reading from/ writing to the store inside of mappings.
I would not not attempt to synchronize across data sources, nor would you need to if you restrict the mappings in the way I describe above. If the data sources represent two different blockchains, then the user should specify at query time, as of which block they wish to query for each data source.
One question is if we want to support multiple transaction trigger types for the same blockchain (i.e. Solidity events, external transaction triggers, block triggers, internal transaction triggers in Etheruem) within a single datasource. Synchronization is not the challenge here, since they all happen within a single block, rather it's that the ordering of these triggers relative to one another would need to be specified.
Update based on our conversation in Discord:
In the future, we could introduce the notion of compound datasources which require that an Indexing Node interact w/ multiple blockchains or storage networks, and specifies an interleaving strategy.
However, in the default case, an Indexing Node should not need to index every data source in a subgraph in order to run the mappings for a single data source.