Block explorer data
Block explorer data deserves first class support in The Graph. Not every node needs to index block explorer data but those that want to should be able to efficiently. Block explorer data includes indexing:
- Blocks
- Transactions
- Accounts
- Receipts
This should be possible without having to define a subgraph specifically for this data. Rather this should be specified as a CLI argument when starting up the node.
Questions:
- Should we automatically index internal transactions?
- What options should we enable for sharding? Should we skip this for a first pass?
- How do we namespace these entity types? Should there still be GraphQL types that live in a subgraph?
@tarrencev expressed the wish to call contracts from the graphql client, so that he could get the latest state of the contract without having to index it. This can be done Etherscan, seems reasonable to put it on the wishlist for our block explorer functionality.
Rationale / Use Cases
I think Yaniv's original description covers this. We need this for users to be able to query blockchain-intrinsic data from Ethereum (blocks, transactions, accounts, transaction receipts). We also want subgraphs to be able to reference this kind of information from their subgraphs.
TODO: add use cases.
Requirements
- Supports at least:
- blocks
- transactions
- transaction receipts
- logs
- accounts (with balances).
- Comes in the form of a built-in subgraph, served at
/subgraphs/ethereumand/subgraphs/ethereum/graphql(GraphiQL), similar to the/subgraphssubgraph. Supports the same GraphQL API as we use for all subgraphs. - Can be enabled with a
--ethereum-subgraphcommand line flag for Graph Node. Disabled by default. - All indexed Ethereum data can be referenced from other subgraphs.
- This importing approach drives subgraph index and replaces the block ingestor.
Proposed User Experience
Querying block explorer data
After started with --ethereum-subgraph or ETHEREUM_SUBGRAPH=true, the Graph Node indexes the entire chain from the genesis block to the latest block, and then follows the chain as new blocks are being added.
Users can access the data by going to http://localhost:8000/subgraphs/ethereum/graphql (GraphiQL) or by using http://localhost:8000/subgraphs/ethereum (or the WS alternative) in their apps. They can send queries like the following to this endpoint to query Ethereum data:
{
blocks(where: { number_gte: 0, number_lt: 1000 }, orderBy: number) {
hash
transactions(orderBy: gas, orderDirection: desc) {
hash
receipt { ... }
}
}
accounts(where: { address: "0x..." }) {
balance
}
}
Referencing block explorer data
From a user's perspective, whether data comes from one subgraph or another should not matter. Assuming a field owner: Account! refers to the Account entity from the built-in Ethereum subgraph, a query like
{
domains {
owner {
balance
}
}
}
should just work™. This includes being able to introspect the Account type and any related types in the GraphQL playground.
From a subgraph developer's perspective, the main novelty is subgraph composition. Given a subgraph name or deployment ID, types from the subgraph with that name or deployment ID can be imported and referenced in the GraphQL schema using a new @import directive:
@import(
from: {
name: 'ethereum' # or id: 'Qm...'
}
as: 'Ethereum' # required prefix
)
type User @entity {
account: Ethereum__Account!
}
Open Questions
- Should subgraph composition allow subgraph mappings to access instances of the imported entity types in the store?
- Decision: no.
- An alternative to the
@importdirective would be an import comment syntax a la https://github.com/prisma/graphql-import- Decision: the
@importdirective is good.
- Decision: the
- Referencing and querying entities across the subgraph boundary could lead to querying both subgraphs at different times/blocks. What do we do here?
- Decision: To be solved later when we support time-travel queries.
Proposed Implementation
Graph Node
- Add a GraphQL schema for the built-in
ethereumsubgraph. - Add an
--ethereum-subgraphCLI flag tograph-node. - Add an
EthereumIngestorcomponent todatasource/ethereum.- This component can be used as an alternative to the existing
BlockIngestor, allowing our ingestion of Ethereum block explorer data to drive subgraph indexing in the same way. - The ingestor detects reorgs and marks blocks that are no longer part of the chain as uncled.
- This component can be used as an alternative to the existing
- When the
--ethereum-subgraphCLI flag is provided, useEthereumIngestorinstead ofBlockIngestor. - Because the
EthereumIngestorwill be writing blocks constantly when catching up, changeChainHeadUpdateListenerto poll for the latest block in repeated intervals as long as the latest block we have is far enough behind the head of the chain. - Add a new
EthereumSubgraphResolverto resolve queries for the built-inethereumsubgraph. - Update the GraphQL query execution to allow querying across different subgraphs, switching resolvers where necessary.
- Update the GraphQL servers to resolve the schema for the built-in
ethereumsubgraph. - Add
/subgraphs/ethereumroutes to the GraphQL servers. - Add
@importvalidation to Graph CLI.
Open Questions
- When composing subgraphs, how do we make introspection work? We'd have to merge schemas basically but that may result in type conflicts.
- Decision: use the prefix provided in the
@importdirective and include the whole schema using the prefix.
- Decision: use the prefix provided in the
- Should we try piggybacking on entities for the initial pass? Or should we go straight to custom database tables? (My recommendation: try entities initially.)
- Decision: custom tables.
- Should we go straight to custom ID fields (e.g.
hashandnumberfor blocks) or stick toidfor the initial pass? (My recommendation: stick toidinitially.)- Decision:
idfor the initial pass.
- Decision:
Proposed Documentation Updates
- Document the
--ethereum-subgraphCLI flag in the graph-node README. - Document the
@importfeature on https://thegraph.com/docs. - Document the
/subgraphs/ethereumendpoint and the GraphQL schema on https://thegraph.com/docs.
Proposed Tests / Acceptance Criteria
- Test the basic ingestion of all types of data using a mocked
EthereumAdapter. - Test querying the Ethereum subgraph endpoint against a store with test data.
- Test validating and building subgraphs with
@importin Graph CLI.
Tasks
- [ ] Database preparation
- [ ] Add a GraphQL schema for the built-in
ethereumsubgraph. - [ ] Define schemas for Ethereum tables.
- [ ] Add a database migration to create the Ethereum tables.
- [ ] Add a basic
EthereumIngestorcomponent todatasource/ethereum(no reorgs).
- [ ] Add a GraphQL schema for the built-in
- [ ] Data ingestion
- [ ] Add an
--ethereum-subgraphCLI flag to graph-node. - [ ] Add a basic
EthereumIngestorthat doesn't handle reorgs yet and doesn't replace theBlockIngestoryet. - [ ] Allow
EthereumIngestorandBlockIngestorto be used interchangeably. This may require changes toBlockStream,EthereumAdapter,BlockIngestorandChainStore. UseEthereumIngestorinstead ofBlockIngestorwhen--ethereum-subgraphis provided. - [ ] Handle chain reorgs in
EthereumIngestorby marking discarded blocks as uncled.
- [ ] Add an
- [ ] Querying
- [ ] Add an
EthereumResolverto resolve block explorer queries. - [ ] Make GraphQL query execution capable of executing queries across different subgraphs, switching resolvers where necessary.
- [ ] Load and merge schemas according to
@importdirectives. - [ ] Update the GraphQL servers to resolve the schema for the built-in
ethereumsubgraph. - [ ] Add
/subgraphs/ethereumroutes to the GraphQL servers.
- [ ] Add an
- [ ] Validation and testing
- [ ] Add
@importvalidation to Graph CLI, including tests. - [ ] Add tests for basic ingestion of all Ethereum data types using a mocked
EthereumAdapter. - [ ] Add tests for querying block explorer data against a store with test data.
- [ ] Add
- [ ] Documentation
- [ ] Document the
--ethereum-subgraphCLI flag in the graph-node README. - [ ] Document the
@importfeature on https://thegraph.com/docs. - [ ] Document the
/subgraphs/ethereumendpoint and the GraphQL schema on https://thegraph.com/docs.
- [ ] Document the
Nice job putting this together!
My input on some of the open questions:
- Should subgraph composition allow subgraph mappings to access instances of the imported entity types in the store? My vote would be no. We should assume that in the long term relationships across subgraph entities will be plentiful, and requiring an Indexer to have all the subgraphs indexing (or available) in order to run the mappings for a single subgraph would be untenable. For the block explorer use case specifically, I believe we already expose much of the data in the mappings anyways.
- An alternative to the @import directive would be an import comment syntax a la https://github.com/prisma/graphql-import
I like the directive syntax better than the comment syntax option. I presume that I could supply:
from: { id: Qmsdf58... }if I wanted to reference a subgraph by ID rather than by name? - Referencing and querying entities across the subgraph boundary could lead to querying both subgraphs at different times/blocks. What do we do here? I think by default all subgraphs should be queried as of the same block. Currently we only support querying as of the "latest" block but we should also support querying as of a block supplied by the user in the query. Eventually, it should be possible to specify different blocks to query as of when traversing entity relationships. Not sure what the right stop gap is for right now, but I don't think it's desirable to have state queried across different blocks w/o this intention being expressed by the user.
- When composing subgraphs, how do we make introspection work? We'd have to merge schemas basically but that may result in type conflicts.
Could we make the
asfield in the@importdirective supply a namespace for the entire imported subgraph schema, rather than just alias a specific type? - Should we go straight to custom ID fields (e.g. hash and number for blocks) or stick to id for the initial pass? (My recommendation: stick to id initially.)
I'm also a fan of making
idthe block hash initially, and makenumbera normal attribute (which could return multiple blocks if we have uncles/ forked blocks in our DB).
We could make double underscore reserved in type names, so that we always have it available to prefix imported types with a namespace.
i.e.,
@import(
from: { name: 'ethereum' }
as: 'Eth', # Optional renaming
)
type User @entity {
account: Eth__Account!
}
Other feedback/questions:
- I'm a little bit hesitant to bake Ethereum any deeper into graph-node, with a special CLI flag, endpoint, etc. But I suppose we have a bigger refactor in store, anyways when we switch to the multi-blockchain architecture.
- Do we have an API in mind for traversing reverse relationships at query time?
{
domains {
owner {
__from( name: 'ens' ) {
ownedDomains {
names
}
}
}
}
}
- What should the behavior be if a user runs
graph-nodelocally and deploys a subgraph locally which references block explorer types? Should we automatically run the block explorer subgraph as well, or simply fail gracefully at query time? We will have to answer this question in the general case when we support subgraph composition as a first-class feature.
@Zerim Thanks for all the comments, I've incorporated them into the @import design and as decisions under the open questions.
About the other feedback/questions:
-
I think it'll just have to be that way for now.
-
I haven't thought about reverse relationships yet.
-
I'd handle that gracefully by just returning
nullfor those references.
This here causes me headaches:
Should we go straight to custom ID fields (e.g. hash and number for blocks) or stick to id for the initial pass? (My recommendation: stick to id initially.) I'm also a fan of making id the block hash initially, and make number a normal attribute (which could return multiple blocks if we have uncles/ forked blocks in our DB).
To not blow up storage, we can only store entities when they change; for time-travel queries, we need an efficient way to find the latest version of a given entity before some point in time. That's easy if we only store whatever we think the main chain is at any point in time, since blocks then have a total order. In the presence of uncled blocks, there's a bunch of detail to be worked out, and we have to carefully look at the kinds of queries we need to support for uncled blocks and see if there are simpler ways to support time-travel in the presence of uncled blocks.
@lutter I think there's two separate questions here:
- How to handle querying any subgraph as of a certain block (this is complicated by being able to query as of an uncle block, as you mention).
- How to represent uncled blocks as entities in our block explorer data model.
I think we can follow my recommendation for 2, w/o it forcing a specific design on 1.
@Zerim one thing I don't understand about uncles is that they only need to have valid block headers, which means to me that that is all you can reliably query about uncled blocks. That to me means that they are not full blocks, and we should treat them as additional data attached to a block on the main chain. That wouldn't preclude us from supporting queries by block number that return information about uncles, but it does mean that uncles and blocks on the main chain are treated differently.
@lutter That's correct, unless we had seen an uncle block when it was published, or a forked block before it was reorged, we would only have header information. Which is why, for example, that's all you see on etherscan for forked blocks: https://etherscan.io/blocks_forked
There's a question as to whether if we have all the information for a block that is later forked/uncled, we should retroactively remove everything but the header to keep consistent with other uncled blocks we know of.
For very recent blocks, I'm sort of partial to keeping around as much information as possible, and then maybe pruning the remaining data when were confident that the block would be permanently forked/uncled.
I'm concerned about the proposed implementation strategy of basically indexing all of the data contained in an archive Ethereum node. That's currently 3 TBs of data when stored as compressed RLP in key-value storage. If we store this as heavily indexed relational data, that will be over 10 TBs. That is a serious operational cost.
I'd instead suggest that this subgraph is implemented by leveraging an archive node, instead duplicating the data from it, and exposes only the queries that we can efficiently resolve through the JSON-RPC interface.
Edit: I overstated the storage because most of that probably corresponds to historical contract state, which we won't need. Still I think the tradeoffs here are worth considering, the storage required will still be an order of magnitude above even the most demanding subgraphs that currently exist.
@leoyvens I'll come up with an estimate of the storage this would occupy. A rough guess based on 10M blocks with 150 transactions each would involve maybe
- 10M block entities
- 300M transaction and transaction receipt entities
- 2M account entities (probably less)
That doesn't feel too excessive. Having this data available in the local database would mean
- better indexing performance (don't need
eth_getLogs, assuming all past blocks are already ingested) - block explorer data queries (don't have to go to Ethereum nodes)
- efficient query-time composition (don't have to go to Ethereum nodes)
The GraphQL API built into geth will take a while to make it into Parity (if it ever will). It would help with query and query-composition performance, but we can't wait for it. Unless block explorer data requires TBs of data, IMHO the benefits we get from ingesting all this data outweighs the storage cost.
I agree that blocks and accounts are something we should just ingest and have great query performance for. However I'd like to make a point that transaction receipts may be a step too far.
Right now we have all transaction receipts loaded in our Graph nodes, storing a total of 330GB. We don't need to do this and I intend to get rid of virtually all of those by doing what is described in this comment, 'Proposed Implementation' section.
This opens the question of whether we should have transaction receipts as part of the subgraph discussed in this issue. First, I'd like to separate the concept of a full block explorer subgraph from a blessed 'Ethereum subgraph' for subgraphs to compose with.
For a full block explorer entities such as Log and Receipt are a requirement, and I believe we should eventually make it possible to build a block explorer subgraph that includes receipts, for the nodes brave enough to run those.
For the Ethereum subgraph to be widely composed with I agree it should be featureful and fast to query, but we also need to keep indexing costs down for there to be a good supply of index nodes and low query prices. I believe the entities Account, Block and Transaction will cover 99% of the use cases, while adding receipts and logs would cover the remaining 1% of use cases for 99% of the cost.
By not including receipts, we could have every index node sync this data by default, allowing us to assume and leverage the data in the internals of graph-node.
@leoyvens I agree with that, although I expect a few of the transaction receipts fields to be crucial enough so that we have to pull them in (thinking about the gas info for instance, which is split across the tx and the tx receipt).
@leoyvens There is one argument for storing logs though: almost every subgraph today defines entities that correspond to the event types and series the events almost 1:1.
If we can allow developers to just reference already existing event entities in the block explorer data, then that would save everyone a ton of time and work.
One aspect that slightly weakens this argument is that subgraphs typically only store a subset of events as entities, not all of them.
@Jannis Having the spent gas is fine, my concern is the logs.
The logs in a block explorer are not decoded, so having all logs is not the same thing as having a subgraph that ingests events with a proper schema.
Revised plan without subgraph composition.
Requirements
- Supports at least:
- blocks
- transactions
- transaction receipts
- internal transactions
- logs
- contracts (with balances)
- accounts (with balances).
- Comes in the form of a built-in subgraph, served at
/subgraphs/ethereum/mainnetand/subgraphs/ethereum/mainnet/graphql(GraphiQL), similar to the/subgraphssubgraph. Supports the same GraphQL API as we use for all subgraphs. - Can be enabled with a
--network-subgraphs ethereum/mainnet ...command line flag for Graph Node. Disabled by default. - All indexed Ethereum data can be referenced from other subgraphs.
- This importing approach drives subgraph index and replaces the block ingestor.
Proposed User Experience
Querying block explorer data
After started with --network-subgraphs ethereum/mainnet, the Graph Node indexes the entire chain from the genesis block to the latest block, and then follows the chain as new blocks are being added.
Users can access the data by going to http://localhost:8000/subgraphs/ethereum/mainnet/graphql (GraphiQL) or by using http://localhost:8000/subgraphs/ethereum/mainnet (or the WS alternative) in their apps. They can send queries like the following to this endpoint to query Ethereum data:
{
blocks(where: { number_gte: 0, number_lt: 1000 }, orderBy: number) {
hash
transactions(orderBy: gas, orderDirection: desc) {
hash
receipt { ... }
}
}
accounts(where: { address: "0x..." }) {
balance
}
}
Referencing block explorer data
From a user's perspective, whether data comes from one subgraph or another should not matter. Assuming a field owner: Account! refers to the Account entity from the built-in Ethereum subgraph, a query like
{
domains {
owner {
balance
}
}
}
should just work™. This includes being able to introspect the Account type and any related types in the GraphQL playground. From a subgraph developer's perspective, the main novelty is subgraph composition. This work is tracked in #1202.
Proposed Implementation
Graph Node
- Add a GraphQL schema for built-in
ethereumnetwork subgraphs. - Add an
--network-subgraphsCLI flag tograph-node. - Add a
NetworkIndexercomponent trait tograph. - Add an
EthereumNetworkIndexercomponent todatasource/ethereum.- This component can be used as an alternative to the existing
BlockIngestor, allowing our ingestion of Ethereum block explorer data to drive subgraph indexing in the same way. - The indexer detects reorgs and marks blocks that are no longer part of the chain as uncled.
- This component can be used as an alternative to the existing
- For every network name passed to
--network-subgraphs, anEthereumNetworkIndexeris created instead of aBlockIngestor. - Because the
EthereumNetworkIndexerwill be writing blocks constantly when catching up, changeChainHeadUpdateListenerto poll for the latest block in repeated intervals as long as the latest block we have is far enough behind the head of the chain.
Open Questions
- Should we go straight to custom ID fields (e.g.
hashandnumberfor blocks) or stick toidfor the initial pass? (My recommendation: stick toidinitially.)- Decision:
idfor the initial pass.
- Decision:
Proposed Documentation Updates
- Document the
--network-subgraphsCLI flag in the graph-node README. - Document the
/subgraphs/ethereum/mainnetendpoints and the GraphQL schema on https://thegraph.com/docs.
Proposed Tests / Acceptance Criteria
- Test the basic ingestion of all types of data using a mocked
EthereumAdapter. - Test querying the Ethereum subgraph endpoint against a store with test data.
Tasks
The plan is to implement this in different phases:
-
Basic: The
EthereumNetworkIndexerruns alongsideBlockIngestorand only ingests blocks. The--network-subgraphsflag is supported and subgraphs can be queried via e.g./subgraphs/ethereum/mainnet. The indexer handles reorgs by marking blocks as no longer being on the chain when they are reorg-ed out of the chain. -
Transactions: The
EthereumNetworkIndexeralso indexes the following data: transactions, transaction receipts and logs. -
Accounts: The
EthereumNetworkIndexeralso indexes the following data: internal transactions, accounts, contracts, balances. -
Ingestion: The
EthereumNetworkIndexeracts as a drop-in replacement forBlockIngestor. This is activated for any network passed to--network-subgraphs.
The estimates below are in days.
Phase 1 (Basic) [~7 days, target: Nov 27]
- [x] Add a GraphQL schema for Ethereum network subgraphs (0.5)
- [x] Add
--network-subgraphsCLI flag (0.5) - [x] Add
NetworkIndexertrait tograph(0.5) - [x] Add basic
EthereumNetworkIndexerwithout reorg support (1) - [x] Enable
EthereumNetworkIndexerfor all requested networks (0.5) - [x] Add reorg support to
EthereumNetworkIndexer(1) - [x] Add tests for indexing Ethereum blocks using a mocked
EthereumAdapter(1) - [x] Add tests for reorg handling using a mocked
EthereumAdapter(1) - [ ] Document the
--network-subgraphsCLI flag in README (0.5) - [ ] Document the
--network-subgraphsand network subgraph endpoints in the docs. (0.5)
Phase 2 (Transactions) [~4.5 days, target: Dec 4]
- [ ] Add a way to specify network config files (0.5)
- e.g.
--network-subgraphs ethereum/kovan:/path/to/genesis.json
- e.g.
- [ ] Index transactions for the genesis block (1)
- [ ] Index transaction receipts and logs (0.5)
- [ ] Add tests for indexing transactions, receipts and logs using a mocked
EthereumAdapter(1) - [ ] Add tests for reorg handling involving transactions, recepits and logs using a mocked
EthereumAdapter(1) - [ ] Document how to pass in network config files over the CLI (0.5)
Phase 2 (Accounts) [~12.5 days, target: Dec 20]
- [ ] Index internal transactions in
EthereumNetworkIndexer(3) - [ ] Index accounts and contracts in
EthereumNetworkIndexer(without balances) (2) - [ ] Update balances of accounts and contracts based on transactions and internal transactions (3)
- [ ] Add tests for ingesting internal transactions using a mocked
EthereumAdapter(1) - [ ] Add tests for ingesting accounts and contracts using a mocked
EthereumAdapter(1) - [ ] Add tests for updating account/contract balances (including internal transactions) (2)
- [ ] Document how to query internal transactions, accounts, contracts, balances (0.5)
Phase 3 (Ingestion) [~5 days, target: Dec 31]
Details tbd.
@Jannis from that plan, it sounds like we're back to having raw logs in the standard Ethereum subgraph, is that the case?
@Jannis Has this project been tabled/abandoned. Right now I'm working on a project to retrieve all uniswap transaction history for taxes and having the ability to automatically query gas fees paid on each transaction would be incredibly valuable.
If so, is there a workaround or alternate path forward to retrieve the gas used on each transaction without having to also make separate calls to Etherscan or another API?
Thanks!
@Jannis Has this project been tabled/abandoned.
It has been paused but not abandoned permanently. Adding transaction indexing to the existing codebase wouldn't actually be that hard. The difficult part (more difficult than it looks in the plan) is having accurate account balances, because it likely will require replaying block rewards and internal transactions, which will make things extremely slow.
Thanks for the quick reply @Jannis. For our use case (and really all accounting applications), having access to the fees paid across all subgraphs is necessary.
Is there a path forward to retrieve gas fees for transactions on a given subgraph within the graph ecosystem right now?
@Jannis Are you able to share what this has been paused in favor of? This seems fundamental to the success of TheGraph. Transactions are where the action happens on chain. TheGraph is being compared as the "Google of Blockchains". Keeping with the analogy, returning only blocks is akin to Google only returning domains instead of actual webpages.
@Jannis Wanting to follow up here if there is an ETA or if there is any workaround for retrieving transaction gas fees via The Graph. I did come across this sub-graph which claims to return fees but the transactions don't return anything when querying: https://thegraph.com/explorer/subgraph/sistemico/eth-gas
Thanks!
Very disappointing we can't even get a response to a simple question here. Does not instill a lot of trust in the graph ecosystem. We will be pursuing alternatives.
Will this allow querying transactions for a specific address, with a given data parameter? Something like this:
{
transactions(where: { to: "0x...", data: "0x1234abcd" }){
value
from
hash
}
}
@Jannis Has this project been tabled/abandoned.
It has been paused but not abandoned permanently. Adding transaction indexing to the existing codebase wouldn't actually be that hard. The difficult part (more difficult than it looks in the plan) is having accurate account balances, because it likely will require replaying block rewards and internal transactions, which will make things extremely slow.
You don't need to maintain the merkle-patricia tree for the state, since you are consuming the data from a trusted source. So it's just EVM plus a flat state storage. Is it really extremely slow? It will be way faster than syncing a geth node in --syncmode=full.
When is this going to be live?