graph-node Block explorer data

Block explorer data deserves first class support in The Graph. Not every node needs to index block explorer data but those that want to should be able to efficiently. Block explorer data includes indexing:

Blocks
Transactions
Accounts
Receipts

This should be possible without having to define a subgraph specifically for this data. Rather this should be specified as a CLI argument when starting up the node.

Questions:

Should we automatically index internal transactions?
What options should we enable for sharding? Should we skip this for a first pass?
How do we namespace these entity types? Should there still be GraphQL types that live in a subgraph?

Aug 26 '18 00:08 yanivtal

@tarrencev expressed the wish to call contracts from the graphql client, so that he could get the latest state of the contract without having to index it. This can be done Etherscan, seems reasonable to put it on the wishlist for our block explorer functionality.

Sep 21 '18 21:09 leoyvens

Rationale / Use Cases

I think Yaniv's original description covers this. We need this for users to be able to query blockchain-intrinsic data from Ethereum (blocks, transactions, accounts, transaction receipts). We also want subgraphs to be able to reference this kind of information from their subgraphs.

TODO: add use cases.

Requirements

Supports at least:
- blocks
- transactions
- transaction receipts
- logs
- accounts (with balances).
Comes in the form of a built-in subgraph, served at /subgraphs/ethereum and /subgraphs/ethereum/graphql (GraphiQL), similar to the /subgraphs subgraph. Supports the same GraphQL API as we use for all subgraphs.
Can be enabled with a --ethereum-subgraph command line flag for Graph Node. Disabled by default.
All indexed Ethereum data can be referenced from other subgraphs.
This importing approach drives subgraph index and replaces the block ingestor.

Proposed User Experience

Querying block explorer data

After started with --ethereum-subgraph or ETHEREUM_SUBGRAPH=true, the Graph Node indexes the entire chain from the genesis block to the latest block, and then follows the chain as new blocks are being added.

Users can access the data by going to http://localhost:8000/subgraphs/ethereum/graphql (GraphiQL) or by using http://localhost:8000/subgraphs/ethereum (or the WS alternative) in their apps. They can send queries like the following to this endpoint to query Ethereum data:

{
  blocks(where: { number_gte: 0, number_lt: 1000 }, orderBy: number) {
    hash
    transactions(orderBy: gas, orderDirection: desc) {
      hash
      receipt { ... }
    }
  }
  accounts(where: { address: "0x..." }) {
    balance
  }
}

Referencing block explorer data

From a user's perspective, whether data comes from one subgraph or another should not matter. Assuming a field owner: Account! refers to the Account entity from the built-in Ethereum subgraph, a query like

{
  domains {
    owner {
      balance
    }
  }
}

should just work™. This includes being able to introspect the Account type and any related types in the GraphQL playground.

From a subgraph developer's perspective, the main novelty is subgraph composition. Given a subgraph name or deployment ID, types from the subgraph with that name or deployment ID can be imported and referenced in the GraphQL schema using a new @import directive:

@import(
  from: {
    name: 'ethereum' # or id: 'Qm...'
  }
  as: 'Ethereum' # required prefix
)

type User @entity {
  account: Ethereum__Account!
}

Open Questions

Should subgraph composition allow subgraph mappings to access instances of the imported entity types in the store?
- Decision: no.
An alternative to the @import directive would be an import comment syntax a la https://github.com/prisma/graphql-import
- Decision: the @import directive is good.
Referencing and querying entities across the subgraph boundary could lead to querying both subgraphs at different times/blocks. What do we do here?
- Decision: To be solved later when we support time-travel queries.

Proposed Implementation

Graph Node

Add a GraphQL schema for the built-in ethereum subgraph.
Add an --ethereum-subgraph CLI flag to graph-node.
Add an EthereumIngestor component to datasource/ethereum.
- This component can be used as an alternative to the existing BlockIngestor, allowing our ingestion of Ethereum block explorer data to drive subgraph indexing in the same way.
- The ingestor detects reorgs and marks blocks that are no longer part of the chain as uncled.
When the --ethereum-subgraph CLI flag is provided, use EthereumIngestor instead of BlockIngestor.
Because the EthereumIngestor will be writing blocks constantly when catching up, change ChainHeadUpdateListener to poll for the latest block in repeated intervals as long as the latest block we have is far enough behind the head of the chain.
Add a new EthereumSubgraphResolver to resolve queries for the built-in ethereum subgraph.
Update the GraphQL query execution to allow querying across different subgraphs, switching resolvers where necessary.
Update the GraphQL servers to resolve the schema for the built-in ethereum subgraph.
Add /subgraphs/ethereum routes to the GraphQL servers.
Add @import validation to Graph CLI.

Open Questions

When composing subgraphs, how do we make introspection work? We'd have to merge schemas basically but that may result in type conflicts.
- Decision: use the prefix provided in the @import directive and include the whole schema using the prefix.
Should we try piggybacking on entities for the initial pass? Or should we go straight to custom database tables? (My recommendation: try entities initially.)
- Decision: custom tables.
Should we go straight to custom ID fields (e.g. hash and number for blocks) or stick to id for the initial pass? (My recommendation: stick to id initially.)
- Decision: id for the initial pass.

Proposed Documentation Updates

Document the --ethereum-subgraph CLI flag in the graph-node README.
Document the @import feature on https://thegraph.com/docs.
Document the /subgraphs/ethereum endpoint and the GraphQL schema on https://thegraph.com/docs.

Proposed Tests / Acceptance Criteria

Test the basic ingestion of all types of data using a mocked EthereumAdapter.
Test querying the Ethereum subgraph endpoint against a store with test data.
Test validating and building subgraphs with @import in Graph CLI.

Tasks

[ ] Database preparation
- [ ] Add a GraphQL schema for the built-in ethereum subgraph.
- [ ] Define schemas for Ethereum tables.
- [ ] Add a database migration to create the Ethereum tables.
- [ ] Add a basic EthereumIngestor component to datasource/ethereum (no reorgs).
[ ] Data ingestion
- [ ] Add an --ethereum-subgraph CLI flag to graph-node.
- [ ] Add a basic EthereumIngestor that doesn't handle reorgs yet and doesn't replace the BlockIngestor yet.
- [ ] Allow EthereumIngestor and BlockIngestor to be used interchangeably. This may require changes to BlockStream, EthereumAdapter, BlockIngestor and ChainStore. Use EthereumIngestor instead of BlockIngestor when --ethereum-subgraph is provided.
- [ ] Handle chain reorgs in EthereumIngestor by marking discarded blocks as uncled.
[ ] Querying
- [ ] Add an EthereumResolver to resolve block explorer queries.
- [ ] Make GraphQL query execution capable of executing queries across different subgraphs, switching resolvers where necessary.
- [ ] Load and merge schemas according to @import directives.
- [ ] Update the GraphQL servers to resolve the schema for the built-in ethereum subgraph.
- [ ] Add /subgraphs/ethereum routes to the GraphQL servers.
[ ] Validation and testing
- [ ] Add @import validation to Graph CLI, including tests.
- [ ] Add tests for basic ingestion of all Ethereum data types using a mocked EthereumAdapter.
- [ ] Add tests for querying block explorer data against a store with test data.
[ ] Documentation
- [ ] Document the --ethereum-subgraph CLI flag in the graph-node README.
- [ ] Document the @import feature on https://thegraph.com/docs.
- [ ] Document the /subgraphs/ethereum endpoint and the GraphQL schema on https://thegraph.com/docs.

Apr 30 '19 14:04 Jannis

Nice job putting this together!

My input on some of the open questions:

Should subgraph composition allow subgraph mappings to access instances of the imported entity types in the store? My vote would be no. We should assume that in the long term relationships across subgraph entities will be plentiful, and requiring an Indexer to have all the subgraphs indexing (or available) in order to run the mappings for a single subgraph would be untenable. For the block explorer use case specifically, I believe we already expose much of the data in the mappings anyways.
An alternative to the @import directive would be an import comment syntax a la https://github.com/prisma/graphql-import I like the directive syntax better than the comment syntax option. I presume that I could supply: from: { id: Qmsdf58... } if I wanted to reference a subgraph by ID rather than by name?
Referencing and querying entities across the subgraph boundary could lead to querying both subgraphs at different times/blocks. What do we do here? I think by default all subgraphs should be queried as of the same block. Currently we only support querying as of the "latest" block but we should also support querying as of a block supplied by the user in the query. Eventually, it should be possible to specify different blocks to query as of when traversing entity relationships. Not sure what the right stop gap is for right now, but I don't think it's desirable to have state queried across different blocks w/o this intention being expressed by the user.
When composing subgraphs, how do we make introspection work? We'd have to merge schemas basically but that may result in type conflicts. Could we make the as field in the @import directive supply a namespace for the entire imported subgraph schema, rather than just alias a specific type?
Should we go straight to custom ID fields (e.g. hash and number for blocks) or stick to id for the initial pass? (My recommendation: stick to id initially.) I'm also a fan of making id the block hash initially, and make number a normal attribute (which could return multiple blocks if we have uncles/ forked blocks in our DB).

We could make double underscore reserved in type names, so that we always have it available to prefix imported types with a namespace.

i.e.,

@import( 
  from: { name: 'ethereum' }
  as: 'Eth', # Optional renaming
)

type User @entity {
  account: Eth__Account!
}

Other feedback/questions:

I'm a little bit hesitant to bake Ethereum any deeper into graph-node, with a special CLI flag, endpoint, etc. But I suppose we have a bigger refactor in store, anyways when we switch to the multi-blockchain architecture.
Do we have an API in mind for traversing reverse relationships at query time?

{
  domains {
    owner {
      __from( name: 'ens' ) {
        ownedDomains {
          names
        }
      }
    }
  }
}

What should the behavior be if a user runs graph-node locally and deploys a subgraph locally which references block explorer types? Should we automatically run the block explorer subgraph as well, or simply fail gracefully at query time? We will have to answer this question in the general case when we support subgraph composition as a first-class feature.

Apr 30 '19 22:04 Zerim

@Zerim Thanks for all the comments, I've incorporated them into the @import design and as decisions under the open questions.

About the other feedback/questions:

I think it'll just have to be that way for now.
I haven't thought about reverse relationships yet.
I'd handle that gracefully by just returning null for those references.

May 15 '19 16:05 Jannis

This here causes me headaches:

Should we go straight to custom ID fields (e.g. hash and number for blocks) or stick to id for the initial pass? (My recommendation: stick to id initially.) I'm also a fan of making id the block hash initially, and make number a normal attribute (which could return multiple blocks if we have uncles/ forked blocks in our DB).

To not blow up storage, we can only store entities when they change; for time-travel queries, we need an efficient way to find the latest version of a given entity before some point in time. That's easy if we only store whatever we think the main chain is at any point in time, since blocks then have a total order. In the presence of uncled blocks, there's a bunch of detail to be worked out, and we have to carefully look at the kinds of queries we need to support for uncled blocks and see if there are simpler ways to support time-travel in the presence of uncled blocks.

May 17 '19 17:05 lutter

@lutter I think there's two separate questions here:

How to handle querying any subgraph as of a certain block (this is complicated by being able to query as of an uncle block, as you mention).
How to represent uncled blocks as entities in our block explorer data model.

I think we can follow my recommendation for 2, w/o it forcing a specific design on 1.

May 21 '19 17:05 Zerim

@Zerim one thing I don't understand about uncles is that they only need to have valid block headers, which means to me that that is all you can reliably query about uncled blocks. That to me means that they are not full blocks, and we should treat them as additional data attached to a block on the main chain. That wouldn't preclude us from supporting queries by block number that return information about uncles, but it does mean that uncles and blocks on the main chain are treated differently.

May 21 '19 17:05 lutter

@lutter That's correct, unless we had seen an uncle block when it was published, or a forked block before it was reorged, we would only have header information. Which is why, for example, that's all you see on etherscan for forked blocks: https://etherscan.io/blocks_forked

There's a question as to whether if we have all the information for a block that is later forked/uncled, we should retroactively remove everything but the header to keep consistent with other uncled blocks we know of.

For very recent blocks, I'm sort of partial to keeping around as much information as possible, and then maybe pruning the remaining data when were confident that the block would be permanently forked/uncled.

May 21 '19 21:05 Zerim

I'm concerned about the proposed implementation strategy of basically indexing all of the data contained in an archive Ethereum node. That's currently 3 TBs of data when stored as compressed RLP in key-value storage. If we store this as heavily indexed relational data, that will be over 10 TBs. That is a serious operational cost.

I'd instead suggest that this subgraph is implemented by leveraging an archive node, instead duplicating the data from it, and exposes only the queries that we can efficiently resolve through the JSON-RPC interface.

Edit: I overstated the storage because most of that probably corresponds to historical contract state, which we won't need. Still I think the tradeoffs here are worth considering, the storage required will still be an order of magnitude above even the most demanding subgraphs that currently exist.

Sep 09 '19 20:09 leoyvens

@leoyvens I'll come up with an estimate of the storage this would occupy. A rough guess based on 10M blocks with 150 transactions each would involve maybe

10M block entities
300M transaction and transaction receipt entities
2M account entities (probably less)

That doesn't feel too excessive. Having this data available in the local database would mean

better indexing performance (don't need eth_getLogs, assuming all past blocks are already ingested)
block explorer data queries (don't have to go to Ethereum nodes)
efficient query-time composition (don't have to go to Ethereum nodes)

The GraphQL API built into geth will take a while to make it into Parity (if it ever will). It would help with query and query-composition performance, but we can't wait for it. Unless block explorer data requires TBs of data, IMHO the benefits we get from ingesting all this data outweighs the storage cost.

Oct 02 '19 15:10 Jannis

I agree that blocks and accounts are something we should just ingest and have great query performance for. However I'd like to make a point that transaction receipts may be a step too far.

Right now we have all transaction receipts loaded in our Graph nodes, storing a total of 330GB. We don't need to do this and I intend to get rid of virtually all of those by doing what is described in this comment, 'Proposed Implementation' section.

This opens the question of whether we should have transaction receipts as part of the subgraph discussed in this issue. First, I'd like to separate the concept of a full block explorer subgraph from a blessed 'Ethereum subgraph' for subgraphs to compose with.

For a full block explorer entities such as Log and Receipt are a requirement, and I believe we should eventually make it possible to build a block explorer subgraph that includes receipts, for the nodes brave enough to run those.

For the Ethereum subgraph to be widely composed with I agree it should be featureful and fast to query, but we also need to keep indexing costs down for there to be a good supply of index nodes and low query prices. I believe the entities Account, Block and Transaction will cover 99% of the use cases, while adding receipts and logs would cover the remaining 1% of use cases for 99% of the cost.

By not including receipts, we could have every index node sync this data by default, allowing us to assume and leverage the data in the internals of graph-node.

Oct 08 '19 19:10 leoyvens

@leoyvens I agree with that, although I expect a few of the transaction receipts fields to be crucial enough so that we have to pull them in (thinking about the gas info for instance, which is split across the tx and the tx receipt).

Oct 08 '19 21:10 Jannis

@leoyvens There is one argument for storing logs though: almost every subgraph today defines entities that correspond to the event types and series the events almost 1:1.

If we can allow developers to just reference already existing event entities in the block explorer data, then that would save everyone a ton of time and work.

One aspect that slightly weakens this argument is that subgraphs typically only store a subset of events as entities, not all of them.

Oct 08 '19 22:10 Jannis

@Jannis Having the spent gas is fine, my concern is the logs.

The logs in a block explorer are not decoded, so having all logs is not the same thing as having a subgraph that ingests events with a proper schema.

Oct 09 '19 18:10 leoyvens

Revised plan without subgraph composition.

Requirements

Supports at least:
- blocks
- transactions
- transaction receipts
- internal transactions
- logs
- contracts (with balances)
- accounts (with balances).
Comes in the form of a built-in subgraph, served at /subgraphs/ethereum/mainnet and /subgraphs/ethereum/mainnet/graphql (GraphiQL), similar to the /subgraphs subgraph. Supports the same GraphQL API as we use for all subgraphs.
Can be enabled with a --network-subgraphs ethereum/mainnet ... command line flag for Graph Node. Disabled by default.
All indexed Ethereum data can be referenced from other subgraphs.
This importing approach drives subgraph index and replaces the block ingestor.

Proposed User Experience

Querying block explorer data

After started with --network-subgraphs ethereum/mainnet, the Graph Node indexes the entire chain from the genesis block to the latest block, and then follows the chain as new blocks are being added.

Users can access the data by going to http://localhost:8000/subgraphs/ethereum/mainnet/graphql (GraphiQL) or by using http://localhost:8000/subgraphs/ethereum/mainnet (or the WS alternative) in their apps. They can send queries like the following to this endpoint to query Ethereum data:

{
  blocks(where: { number_gte: 0, number_lt: 1000 }, orderBy: number) {
    hash
    transactions(orderBy: gas, orderDirection: desc) {
      hash
      receipt { ... }
    }
  }
  accounts(where: { address: "0x..." }) {
    balance
  }
}

Referencing block explorer data

From a user's perspective, whether data comes from one subgraph or another should not matter. Assuming a field owner: Account! refers to the Account entity from the built-in Ethereum subgraph, a query like

{
  domains {
    owner {
      balance
    }
  }
}

should just work™. This includes being able to introspect the Account type and any related types in the GraphQL playground. From a subgraph developer's perspective, the main novelty is subgraph composition. This work is tracked in #1202.

Proposed Implementation

Graph Node

Add a GraphQL schema for built-in ethereum network subgraphs.
Add an --network-subgraphs CLI flag to graph-node.
Add a NetworkIndexer component trait to graph.
Add an EthereumNetworkIndexer component to datasource/ethereum.
- This component can be used as an alternative to the existing BlockIngestor, allowing our ingestion of Ethereum block explorer data to drive subgraph indexing in the same way.
- The indexer detects reorgs and marks blocks that are no longer part of the chain as uncled.
For every network name passed to --network-subgraphs, anEthereumNetworkIndexer is created instead of a BlockIngestor.
Because the EthereumNetworkIndexer will be writing blocks constantly when catching up, change ChainHeadUpdateListener to poll for the latest block in repeated intervals as long as the latest block we have is far enough behind the head of the chain.

Open Questions

Should we go straight to custom ID fields (e.g. hash and number for blocks) or stick to id for the initial pass? (My recommendation: stick to id initially.)
- Decision: id for the initial pass.

Proposed Documentation Updates

Document the --network-subgraphs CLI flag in the graph-node README.
Document the /subgraphs/ethereum/mainnet endpoints and the GraphQL schema on https://thegraph.com/docs.

Proposed Tests / Acceptance Criteria

Test the basic ingestion of all types of data using a mocked EthereumAdapter.
Test querying the Ethereum subgraph endpoint against a store with test data.

Tasks

The plan is to implement this in different phases:

Basic: The EthereumNetworkIndexer runs alongside BlockIngestor and only ingests blocks. The --network-subgraphs flag is supported and subgraphs can be queried via e.g. /subgraphs/ethereum/mainnet. The indexer handles reorgs by marking blocks as no longer being on the chain when they are reorg-ed out of the chain.
Transactions: The EthereumNetworkIndexer also indexes the following data: transactions, transaction receipts and logs.
Accounts: The EthereumNetworkIndexer also indexes the following data: internal transactions, accounts, contracts, balances.
Ingestion: The EthereumNetworkIndexer acts as a drop-in replacement for BlockIngestor. This is activated for any network passed to --network-subgraphs.

The estimates below are in days.

Phase 1 (Basic) [~7 days, target: Nov 27]

[x] Add a GraphQL schema for Ethereum network subgraphs (0.5)
[x] Add --network-subgraphs CLI flag (0.5)
[x] Add NetworkIndexer trait to graph (0.5)
[x] Add basic EthereumNetworkIndexer without reorg support (1)
[x] Enable EthereumNetworkIndexer for all requested networks (0.5)
[x] Add reorg support to EthereumNetworkIndexer (1)
[x] Add tests for indexing Ethereum blocks using a mocked EthereumAdapter (1)
[x] Add tests for reorg handling using a mocked EthereumAdapter (1)
[ ] Document the --network-subgraphs CLI flag in README (0.5)
[ ] Document the --network-subgraphs and network subgraph endpoints in the docs. (0.5)

Phase 2 (Transactions) [~4.5 days, target: Dec 4]

[ ] Add a way to specify network config files (0.5)
- e.g. --network-subgraphs ethereum/kovan:/path/to/genesis.json
[ ] Index transactions for the genesis block (1)
[ ] Index transaction receipts and logs (0.5)
[ ] Add tests for indexing transactions, receipts and logs using a mocked EthereumAdapter (1)
[ ] Add tests for reorg handling involving transactions, recepits and logs using a mocked EthereumAdapter (1)
[ ] Document how to pass in network config files over the CLI (0.5)

Phase 2 (Accounts) [~12.5 days, target: Dec 20]

[ ] Index internal transactions in EthereumNetworkIndexer (3)
[ ] Index accounts and contracts in EthereumNetworkIndexer (without balances) (2)
[ ] Update balances of accounts and contracts based on transactions and internal transactions (3)
[ ] Add tests for ingesting internal transactions using a mocked EthereumAdapter (1)
[ ] Add tests for ingesting accounts and contracts using a mocked EthereumAdapter (1)
[ ] Add tests for updating account/contract balances (including internal transactions) (2)
[ ] Document how to query internal transactions, accounts, contracts, balances (0.5)

Phase 3 (Ingestion) [~5 days, target: Dec 31]

Details tbd.

Nov 15 '19 14:11 Jannis

@Jannis from that plan, it sounds like we're back to having raw logs in the standard Ethereum subgraph, is that the case?

Nov 18 '19 13:11 leoyvens

@Jannis Has this project been tabled/abandoned. Right now I'm working on a project to retrieve all uniswap transaction history for taxes and having the ability to automatically query gas fees paid on each transaction would be incredibly valuable.

If so, is there a workaround or alternate path forward to retrieve the gas used on each transaction without having to also make separate calls to Etherscan or another API?

Thanks!

Feb 22 '21 20:02 lucaswalter

@Jannis Has this project been tabled/abandoned.

It has been paused but not abandoned permanently. Adding transaction indexing to the existing codebase wouldn't actually be that hard. The difficult part (more difficult than it looks in the plan) is having accurate account balances, because it likely will require replaying block rewards and internal transactions, which will make things extremely slow.

Feb 22 '21 21:02 Jannis

Thanks for the quick reply @Jannis. For our use case (and really all accounting applications), having access to the fees paid across all subgraphs is necessary.

Is there a path forward to retrieve gas fees for transactions on a given subgraph within the graph ecosystem right now?

Feb 22 '21 22:02 lucaswalter

@Jannis Are you able to share what this has been paused in favor of? This seems fundamental to the success of TheGraph. Transactions are where the action happens on chain. TheGraph is being compared as the "Google of Blockchains". Keeping with the analogy, returning only blocks is akin to Google only returning domains instead of actual webpages.

Feb 22 '21 22:02 Cooksauce

@Jannis Wanting to follow up here if there is an ETA or if there is any workaround for retrieving transaction gas fees via The Graph. I did come across this sub-graph which claims to return fees but the transactions don't return anything when querying: https://thegraph.com/explorer/subgraph/sistemico/eth-gas

Thanks!

Mar 08 '21 19:03 lucaswalter

Very disappointing we can't even get a response to a simple question here. Does not instill a lot of trust in the graph ecosystem. We will be pursuing alternatives.

Mar 14 '21 19:03 lucaswalter

Will this allow querying transactions for a specific address, with a given data parameter? Something like this:

{
  transactions(where: { to: "0x...", data: "0x1234abcd" }){
     value
     from
     hash
  }
}

Jun 16 '21 14:06 benjlevesque

@Jannis Has this project been tabled/abandoned.

It has been paused but not abandoned permanently. Adding transaction indexing to the existing codebase wouldn't actually be that hard. The difficult part (more difficult than it looks in the plan) is having accurate account balances, because it likely will require replaying block rewards and internal transactions, which will make things extremely slow.

You don't need to maintain the merkle-patricia tree for the state, since you are consuming the data from a trusted source. So it's just EVM plus a flat state storage. Is it really extremely slow? It will be way faster than syncing a geth node in --syncmode=full.

Aug 25 '21 15:08 paulperegud

When is this going to be live?

Mar 25 '22 07:03 anoushk1234