RFC: Rindexer Structure Overhaul

Open jaggad opened this issue 6 months ago • 1 comments

This ticket is more of a temporary placeholder / request-for-comments whilst a the scope is fleshed out. The goal is to be able to include a higher level of detail required by some more sophisticated use-cases whilst leveraging the simplicity of the black-box that is rindexer.

Super high-level early-stage thoughts here.

Proposal of additional useful info

An example of the kind of additional transaction details that can be useful for different applications, for example:

To accurately derive native-balance state, we need 'gas price' and 'gas used'
Application tracking gas statistics across chain, for things like gas price recommendations need this information
Applications tracking detailed wallet information such as nonces for invalidation / more "pro user" metadata.
Indexers being able to sort by transactions intra-block if this is required for some use-case may require tx index, not just log index.

Note: even more information is available, and we can transparently make it all available to rust-code users perhaps, but for the no-code postgres enabled use-case we should be more reserved and opt-in.

struct AdditionalTxDetails {
    nonce: String,
    gas: String,
    max_fee_per_gas: String,
    max_priority_fee_per_gas: String,
    value: String,
    gas_price: String,
    transaction_index: String,
    block_timestamp: String,
    transaction_hash: String // To correlate with the logs themselves
}

How we could get this cheaply for live-indexing

Right now we poll cached_provider.get_latest_block which calls self.provider.get_block(BlockId::Number(BlockNumberOrTag::Latest)) under the hood. If we call .get_block().full() instead we will recieve all of this information essentially for free, it is the same number of CU per call, and effecitvely the same network time.

This means near-zero latency impact on live-indexing for tx timestamp, gas, nonce and more. Assuming a polling period which is less than per-block, we feasibly would never even have to do a "fill in request" for a missed block.. but for correctness purposes it would be important to ensure we query any blocks between Latest and Last Seen to get the metadata for them as well.

+ fn map_block_data(last_seen_block_number: u64, latest_block: Block) -> HashMap<TxHash, AdditionalTxDetails> {
+   if last_seen_block_number + 1 == latest_block.block_number {
+       // Next block, no need to fill in request  
+       // ... map over txs in blocks and convert to HashMap
+   } else {
+      //  ... fetch any missed blocks (shouldn't happen much) and then map into the HashMap
+   } 
+ }

let latest_block = cached_provider.get_latest_block().await;
+ let additional_details: HashMap<TxHash, AdditionalTxDetails> = map_block_data(last_seen_block_number, latest_block)

if let Err(e) = tx.send(Ok(FetchLogsResult {
      logs,
      from_block,
      to_block,
+    additional_details
})) {

How we can get this for backfill ?

Backfills typically are rarer events, it's a once-off cost, however there is some room here to get this for free. For example:

In the case where native-transfers are enabled we must call every single block for all enabled chains anyway, so we already would have this information at no extra cost anywhere native-transfers are enabled (in large part because native-transfers must incur a much larger backfill speed-cost compared with log indexing).

But, that means we could also take advantage of that and attach all these extra details

Imagining an overly-simplified config like:

native_transfers: true # index native token transfers. Already much-much slower backfills for native-transfers specifically.
# and/or
with_additional_details: true # index block timestamps, transaction gas prices, gas used, nonces, and more. At the cost of much slower backfills for every event.

So in these cases we have a process like:

Change debug_traceBlockByNumber to eth_getBlockByNumber [full] since we get everything we need there, and cheaper (20 CU vs 40 CU). There is no benefit for it to be debug from what I can tell, and eth_getBlockByNumber is more widely and consistently supported, and has all the extra timestamp information
Scan the logsBloom manually in the eth_getBlockByNumber and call getLogs for this block if
Abstract the fetch_logs_stream so that if this "native-transfer" indexing (name can be made more generic) is enabled, we stream logs from each block handled in the native-transfer block-by-block idnexing rather than from the eth_getLogs process. This means a slightly more abstracted distinction/modularization needs to be made between:

The source of event logs
The processesor/consumer of event logs

This distinction does exist, but might need to be neatened up for this proposal to work.

The other case would be where "native-transfers" are not enabled, in those cases it would potentially be feasible to do a combination of the existing eth_getLog style fast-indexing for much more efficient bloom filter log skipping, but then opt in to the additional_details which would do a multicall of eth_getBlockByNumber for every matching block in the set of logs we get.

More thought required here, I just feel like we're starting to get into more advanced use-cases with debug/trace indexing and block by block indexing with full tx details. Something in the rindexer idnexing process specifically i think needs to be adjusted so these can be switched out optimially

Jun 11 '25 08:06 jaggad

down for this - for sure defo lower in the pile of focus for us now that said we should spec this out for sure as some really good thoughts above

Jun 13 '25 08:06 joshstevens19