ethereum-etl-airflow
ethereum-etl-airflow copied to clipboard
Streaming smart contract events and topics (ABI processing in python instead of BQ)
Right now we have ~500 events that we parse. Every day ~500MB of log data is generated. This sums up to 250GB parsed in BigQuery daily, ~7.5 TB per month. Which totals to ~$37 per month.
With 1500 events we'll spend ~$100 per month.
An alternative to parsing logs in BigQuery is export JSON file to GCS download locally in Airflow and filter all events in a dataset at once, then load to BigQuery (free). There is PoC for how to parse logs in Python here https://github.com/blockchain-etl/ethereum-etl/commit/6710e6b894789d83ce3c28db79b4ec5bf2ec3845.
Another optimisation idea is to separate parsing into 2 steps:
-
select * from logs where contract_address in (<all_contract_addresses_from_table_definistions>)
intofiltered_logs
. - parse each table from
filtered_logs
.
We are going to encounter this in on other chains that support smart contracts in general, and multiple networks that support EVM-compatible contracts specifically. Here are a few examples:
I propose that we abstract this pattern above Ethereum.
Related issues, with emphasis on stream processing. https://github.com/blockchain-etl/ethereum-etl/issues/216 https://github.com/blockchain-etl/ethereum-etl/issues/215
Another related issue for indexing: https://github.com/blockchain-etl/ethereum-etl-airflow/issues/28
Here the proposal is to cluster/partition the results in BQ to reduce IO.
I suggest we take this a step further to generally support all contracts for which we have parsing capability (i.e. those for which we have an ABI).
Specifically, at transaction processing time in the stream:
- detect contract with a matching ABI using a registry (implementation TBD)
- process the contract method call, gather emitted topics and events.
- emit events to a sharded and denormalized set of topics
- first byte of contract (to support streaming contract monitoring) e.g.
contract_events_0x00
- first byte of event topic (to support streaming topic monitoring, such as for templated contracts) e.g. e.g.
topic_events_0x00
- first byte of contract (to support streaming contract monitoring) e.g.
- emit events to an
all_events
topic with an additional possibly composite key, composed of (contract byte and/or event byte). - streaming insert into a new
events
table usingall_events
. This table contains a JSON blob field that is not further optimized for storage, but for which we can leverage BQ native JSON parsing functions to select data of interest with JSONPath, see: How to parse JSON in Standard SQL BigQuery? - create a new JS file that can be included to do dynamic data marshaling from
events
as needed.- JS file can be autogenerated for all available ABI using the same technique presented by @askeluv in How to get any Ethereum smart contract into BigQuery.
- It can be included using the
OPTIONS
clause. - More detail on preparing JS for BQ inclusion is presented by @medvedev1088 in How to package a JavaScript library for use in BigQuery
The overall effect of this design:
- make streams available for events and contracts, with a common schema
- introduce additional overhead to pub/sub and BQ access of 1 byte for sharding. tradeoff is reduced I/O.
- introduce additional overhead of JSON data marshaling by clients of pub/sub and BQ. tradeoff is reduced cognitive load / data discovery problems of having a large clutter of tiny and specialized tables.
Regarding this:
streaming insert into a new events table using all_events. This table contains a JSON blob field that is not further optimized for storage, but for which we can leverage BQ native JSON parsing functions to select data of interest with JSONPath, see: How to parse JSON in Standard SQL BigQuery?
we need to think about how we process historical data for newly added ABIs, so that this data appears in the evets
table in BQ. E.g. have a separate process to reprocess past partitions whenever a new ABI is added to the ABI repository.
@sinahab @epheph @jieyilong FYI we are designing a generic solution for ABI event generation in this issue and also blockchain-etl/ethereum-etl#216
Awesome, will take a look. Thanks!
On Thu, May 21, 2020 at 10:00 AM Allen Day [email protected] wrote:
@sinahab https://github.com/sinahab @epheph https://github.com/epheph @jieyilong https://github.com/jieyilong FYI we are designing a generic solution for ABI event generation in this issue and also blockchain-etl/ethereum-etl#216 https://github.com/blockchain-etl/ethereum-etl/issues/216
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/blockchain-etl/ethereum-etl-airflow/issues/50#issuecomment-632090162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMSYCS6MLS6MKYMHQUAXJTRSVM4RANCNFSM4KPSNCQA .
-- Jieyi Long CTO, Theta Labs, Inc https://www.ThetaLabs.org
Disclosure: I'm very new to this project... so pardon me if I missed or misunderstood things :)
Here are my thoughts, in no particular order, on the proposal by @allenday above. I focused more heavily on the use case I'm trying to solve for Origin Protocol since that's the one I'm the most familiar with.
- Overall, I like the idea of stream processing for blockchain events.
- The main advantage I see is that it would allow to process events with less delays compared to batch processing. This could be advantageous not only for analytics use cases but also for external systems that rely on blockchain events as "triggers" to perform actions. For the Origin Protocol use case, specific events emitted by our smart contract act as triggers for several different centralized external systems (ex: search indexing, notification, analytics, ...). And freshness matters.
- One small drawback. It's not a problem for the Origin Protocol use case (and I suspect for most other use cases ), but it's probably worth mentioning that ordering is no guaranteed when using pub/sub. Therefore events may get processed out of order.
- A backfill solution for processing legacy events would definitely be required. A simple one-off batch job (using ethereum-etl?) that enqueues legacy events into the relevant pub-sub topics would be sufficient I think. For Origin Protocol's use case, the data processing is idempotent. So it does not matter if the same event would get enqueued more than once (by the backfill and by the live system).
- I don't fully understand the need for having both an all_events queue and a sharded queue. From a consumer perspective, even if the sharded queue has topics sharded using the 1st byte of the contract address + first byte of the event topic, the consumer will still need to apply extra filtering on the messages it receives to determine if it should process them.
- A registry of ABIs makes sense. Origin Protocol would be happy to include the ABI for its marketplace smart contract (it's under MIT license). Perhaps to start with, it could simply be a new github repositiory under the blockchain-etl project? Then developers/companies interested in the blockchain-etl project could add their ABIs to it.
- Regarding the events table. I am a BigQuery newbie... I'd be curious about the performance (e.g. latency of queries) of using Javascript in BigQuery to demarshall the data stored as JSON. Though I guess if that's a concern, people could always derive their own BigQuery tables and use a structured schema for it. And I do agree that this is a minor concern compared to the advantage of having a single table that includes all the events. Compared to hundreds of smaller specialized tables populated by the parsing stage of the ethereum-etl-airflow project. My argument being that it's more cumbersome to write queries against lots of small tables. For example the Origin Protocol contract emits 16 different events and we have 2 version of our contracts. Which means 16x2 different tables to join if wanting to analyze all the events.