graph-node File data sources

GIP https://forum.thegraph.com/t/gip-file-data-sources/2721

Status:

[x] Manifest parsing #3743
[x] File Monitor PR #3411
[ ] DB representation of file data sources #3742
[ ] Subgraph runner and runtime
[ ] Causality region isolation
[ ] graph-cli support

Dec 15 '21 17:12 Jannis

File data sources engineering plan

This describes the components required for an implementation of file data sources backed directly by an ipfs client and therefore non-deterministic, as described in this GIP https://forum.thegraph.com/t/gip-file-data-sources/2721. There is still uncertainity if data from multiple chains would be indexed with a total order of events or not. The intention of this first pass is to have the simplest possible non-deterministic implementation, so it is not blocked by the design of the deterministic version.

File monitor

The file monitor receives requests to monitor the availability of files, and once the file become available it returns its contents through a channel. It should also notify when a file becomes unavailable. Each graph node process will instantiate one file monitor to serve all subgraphs.

For ipfs, the monitor will loop through the queue of files to check and pool them with object.stat. limited by max number of requests in flight. To scale this component to a load of millions of files, multiple queues with different priority levels can be implemented so that new requests are pooled more frequently than old requests.

DB representation of file data sources

Dynamic file data sources need to be stored in the DB. I can think of a few options:

Reuse dynamic_ethereum_contract_data_source .
Introduce a specialized table for file data sources.
Introduce a new table to be used for dynamic data sources for all new blockchains, probably with a json field for network-specific fields.

Option 3 seems attractive as it would be a more general solution, but it requires a more careful design of a schema that serves all blockchains.

Data source independency

Data sources and entities are now associated with an access group, which is a numeric id. An entity has the same access group as the data source that created it. Only entites from the same access group are visible to a data source mapping calling store.get. If two access groups attempt to set a same entity, that is a non-deterministic error until we figure out the ideal behaviour.

Non-file data sources all have access group 0. Each file data source gets its own access group. Group ids are represented in the cache and in the DB, to enforce access rules.

File data source execution

Because availability chain history cannot be reliably replayed, the modification of subgraph state by file data source must not depend on availability history. Therefore a file data source handler should not able to know how many times it ran in the past due to changes in availablility.

A file data source is associated with a file and a handler. The handler is invoked with the file contents as soon as the file is available. Before running the handler, all entities from any previous invocation are removed.

The entities are inserted as of the block in which the data source was created. This has the nice property that if the data source is reverted, so are the entities. But this block may be previous to the current subgraph pointer. How to best implement this is and what problems it may cause is to be seen.

Estimates

Detailed estimate and task breakdown pending, current estimate 4-8 weeks.

Mar 03 '22 15:03 leoyvens

Great stuff @leoyvens!

"Before running the handler, all entities from any previous invocation are removed" -> I am not clear what this means? Why would a handler run more than once? (given content addressable content, the outcome of running a given file + handler should be deterministic?)
I think implicit in this initial proposal is that file data sources will not be part of the POI, to start with?

Mar 07 '22 10:03 azf20

@azf20

Previous invocations would occur if the file goes unavailable and then available again. Though in a non-deterministic implementation we could choose not to handle unavailability at all, which would be simpler. So it's an open question if unavailability should be monitored and handled.
I don't recall if we made a decision about that, but it makes sense to me that they should be excluded from PoI.

Mar 07 '22 11:03 leoyvens

Everything tracked in this issue is done!

Apr 10 '24 12:04 leoyvens

graph-node graph-node copied to clipboard

File data sources

File data sources engineering plan

File monitor

DB representation of file data sources

Data source independency

File data source execution

Estimates

graph-node
graph-node copied to clipboard