elixir-omg Multiple Childchain state storage

NET-62

As a child chain operator, I want to have redundancy in how my child chain's state is stored so that I can perform upgrade without any service interruptions.

Risk

requiring a lot of engineering effort to be spent in rearchitecting the state storage

Value

allowing for 0 downtime upgrade, hence better service availability. (Currently upgrading takes about 2-5 minutes of downtime)
reducing risk of failure and loss of transactions
able to keep production services running across cloud provider's zone

Jul 15 '19 14:07 Pongch

Work needed to Scale Child Chain

Block Queue OMG.ChildChain.BlockQueue (how we form a block before submission)
Fees OMG.ChildChain.FeeServer (how much fees we're charging currently loads via file, unsalable)
OMG.State.Core (centralized repo of the current state)

On the Child Chain's part there's a statefull procedure ~validation~ that happens when a transaction is submitted (http post("/transaction.submit", Controller.Transaction, :submit)).

@spec submit(transaction :: binary) ::
          {:ok, %{txhash: Transaction.tx_hash(), blknum: pos_integer, txindex: non_neg_integer}}
          | {:error, submit_error()}
  def submit(transaction) do
    with {:ok, recovered_tx} <- Transaction.Recovered.recover_from(transaction),
         {:ok, fees} <- FeeServer.transaction_fees(),
         fees = Fees.for_tx(recovered_tx, fees),
         {:ok, {tx_hash, blknum, tx_index}} <- State.exec(recovered_tx, fees) do
      {:ok, %{txhash: tx_hash, blknum: blknum, txindex: tx_index}}
    end
    |> result_with_logging()
  end

State.exec/2 is the nitty gritty.

@doc """
  Checks (stateful validity) and executes a spend transaction. Assuming stateless validity!
  """
  def handle_call({:exec, tx, fees}, _from, state) do
    case Core.exec(state, tx, fees) do
      {:ok, tx_result, new_state} ->
        {:reply, {:ok, tx_result}, new_state}

      {tx_result, new_state} ->
        {:reply, tx_result, new_state}
    end
  end

goes into

@spec can_apply_spend(state :: Core.t(), tx :: Transaction.Recovered.t(), fees :: Fees.fee_t()) ::
          true | {{:error, exec_error()}, Core.t()}
  def can_apply_spend(state, %Transaction.Recovered{} = tx, fees) do
    outputs = Transaction.get_outputs(tx)

    with :ok <- validate_block_size(state),
         {:ok, input_amounts_by_currency} <- correct_inputs?(state, tx),
         output_amounts_by_currency = get_amounts_by_currency(outputs),
         :ok <- amounts_add_up?(input_amounts_by_currency, output_amounts_by_currency),
         :ok <- transaction_covers_fee?(input_amounts_by_currency, output_amounts_by_currency, fees) do
      true
    else
      {:error, _reason} = error -> {error, state}
    end
  end

Validating block size -if the number of transactions in block exceeds limit then {:error, :too_many_transactions_in_block} Needs to be locked while a stateful validation is in progress.

Checking correction of input positions -if the input is from the future block then {:error, :input_utxo_ahead_of_state} -if the input does not exists then {:error, :utxo_not_found} -if the owner of input does not match with spender then {:error, :unauthorized_spent} UTXOs need to come from the up to date repo

Checking if the amounts from the provided inputs adds up. - if not then {:error, :amounts_do_not_add_up} If in child chain server tx submission pipeline): see if the transaction pays the correct fee. - if not then {:error, :fees_not_covered}

Findings:

Synchronized UTXO access

OMG State – Child Chain UTXO access

As it currently stands the functional core state logic is being shared between both modes when you’re running a Watcher or when you’re running a Child Chain. For us to move forward, having multiple Child Chain servers, we need to move all UTXOs into a centralized repository where only Child Chains can access and serve them to Watcher.

The change for the above is rather significant and moving forward, it might be better for us to split this into two distinct processes that share some functionality via a shared module repository (as we do now with apps/omg).

Utxo’s are moved away from process state for exiting utxos (State.exit_utxos) a. Questions: Is there an upper limit for determining a change on UTXO? That upper limit is our moving performance. b. Lock Approach https://hexdocs.pm/ecto/Ecto.Multi.html#content https://www.postgresql.org/docs/9.4/explicit-locking.html

def exit_utxos(exiting_utxos, %Core{utxos: utxos} = state) do
    _ = if exiting_utxos != [], do: Logger.info("Recognized exits #{inspect(exiting_utxos)}")

    {valid, _invalid} = validities = Enum.split_with(exiting_utxos, &utxo_exists?(&1, state))

    db_updates = valid |> Enum.map(fn utxo_pos -> {:delete, :utxo, Utxo.Position.to_db_key(utxo_pos)} end)

    new_state = %{state | utxos: Map.drop(utxos, valid)}

    {:ok, {db_updates, validities}, new_state}
  end

Problems with lock approach: If each Child Chain server has it’s own Ethereum client, the events detecting exiting UTXOs will duplicate.

Deliver once with MQ layer with ack.
Store Ethereum height for which the events were processed already. Check DB -> discard or process. Option 2 might be preferable.

Which operations fiddle with UTXO's and need to be worked on:

Core.exec/2
Core.exit_utxos/1
Core.deposit/2 Core.utxo_exists?/2 ExitProcessor.check_validity()

Jul 25 '19 17:07 InoMurko

Hey, wanted to drop some quick notes on this. Let's talk perhaps to get into the details:

Fees OMG.ChildChain.FeeServer (how much fees we're charging currently loads via file, unsalable)

I think this is totally puntable. Let's a/ split it to a separate story b/ solve it in a minimal/dumb way at first.

Block Queue OMG.ChildChain.BlockQueue (how we form a block before submission)

What are your thoughts on this? Gut feel tells me, it should be pretty straightforward, after the OMG.State bit is resolved.

OMG.State.Core (centralized repo of the current state)

A couple of points on this one

It might slightly touch the Abstract Layer Design bit, so let's keep in sync
User story says "without any service interruptions" so I'm assuming we're going full-steam, and we do want to persist the mempool whenever we want to switch the "master" chlid chain. With this assumption, maybe, there's another option to consider - let's ensure the persistence (or consistency) of the mempool without touching the store:

Here's the idea. Let's have the incoming transactions go through a MQ, that would ensure they never get lost and get played to the children chains in the same order, in case sth happens. You could then still be able to manage the UTXO set locally. Whenever the "master" chain drops a new block into the DB, the "slave" ones will recreate UTXO set form there, by playing the txs. If "master" goes down, the new "master" will play txs form the MQ.

Every child chain should then be able to handle its Ethereum-driven changes to the UTXO set independently!

Jul 30 '19 17:07 pdobacz

Did you have a fanout exchange model in mind? I was thinking about the MQ path as well, there's a few things to consider:

You would need to build a MQ publishing layer that picks everything sent to /transaction.submit and puts it on a exchange.
every Child Chain node then has a queue on the MQ and picks all the messages from there
there might be differences between Child Chains processing speed/faults,latency hence this is not a strongly consistent solution

the "slave" ones will recreate UTXO set form there, by playing the txs Could you elaborate this one?

Jul 30 '19 17:07 InoMurko

Did you have a fanout exchange model in mind?

Yeah, that fits, but I'm not super fluent with these models, so sorry about hand-waviness.

the "slave" ones will recreate UTXO set form there, by playing the txs Could you elaborate this one?

So my thinking is that a slave child chain should "prefer" to take the transactions from a block (either Watcher-style via block.get or by reading from the shared data store), rather than play them from the MQ. Then the txs in the MQ are only actively applied by the master chch. The slave chchs would discard the pending txs from the MQ, if they see a new block being published by master. The pending txs in the MQ are only really used by slaves when master goes out.

Does this make sense? My goal is here to the streamline the flow of txs, so that always the block formed by master and present in the DB is "more important"

An additional note on the Ethereum events processing bit: probably the slaves would need to keep an eye on how the master processes those events, for consistent view. But this would be rather simple to have using our current RootChainCoordinator!

I'm thinking about the UTXO set being in a shared repo with locks, it's possible, but a bit scary. Keeping that local is an advantage we're enjoying now without even thinking about it.

Jul 31 '19 08:07 pdobacz

I like @pdobacz's idea for simplicity. I can see a few drawbacks (or only misunderstanding on my side)

How we can provision another instance of ch-ch? We'd need to pick one of the alive slaves, freeze it (make sure it won't consume its queue) for a some time, clone its lvldb files and its queue, start a new instance and unfreeze the first one... Not sure how complex it is from devops side.
How to assign new master Who picks and how to pick new master if current is down? We would need some endpoint to call /youre_master_now to tell slave to start acting as master. We'd need to redirect rpc traffic to the chosen one... (anything else?)

Jul 31 '19 10:07 pnowosie

To expand on 2., @pnowosie, which is bothering me as well. The problem you stated and the solution to it is missing a very important detail. A computer that is down cannot call /youre_master_now.

A mechanism we could facilitate is to use https://en.wikipedia.org/wiki/Raft_(computer_science) (an alternative to Paxos) that would allow us to maintain a consistent view and detect failures in the cluster. My initial thinking tells me that this would allow us to keep the same storage model as now, but facilitate RAFT as a distributed consensus mechanism (replicating the state of utxos in the cluster if you wish).

Riak Core is also one we could research, though it has weak consistency issues/features.

Jul 31 '19 11:07 InoMurko

@pnowosie

How we can provision another instance of ch-ch? We'd need to pick one of the alive slaves, freeze it (make sure it won't consume its queue) for a some time, clone its lvldb files and its queue, start a new instance and unfreeze the first one...

Hm, this is not how I pictured this. All slaves would be in sync, at most 1 block behind master. Except for this 1 block, they'd be exact replicas + they'd have the transactions in their queues identical with what master is processing.

To expand on 2., @pnowosie, which is bothering me as well.

Yeah, me too. So either some consensus mechanism to figure out downage or just the master broadcasts "I am alive"s to slaves. When they don't hear it, they take over (they have some hierarchy). This is just my hand-waving and wishful thinking though.

:thinking: Hm, isn't there a standard/tools for such slave-master interactions?

Jul 31 '19 13:07 pdobacz

🤔 Hm, isn't there a standard/tools for such slave-master interactions?

Yes, that's RAFT :)

Correction: RAFT is a bit more. But RAFT maintains a master-slave topology by voting and election and maintenance, which is what you need in a distributed system.

Jul 31 '19 13:07 InoMurko

Store Ethereum height for which the events were processed already.

Would this be fine in case of re-org? Would it be better we get the eth tx of such event and check the txId?

Jul 31 '19 14:07 boolafish

A few clarifications that arose in private discussions that I'll just post here: This is strictly a child chain mechanism, because neither RAFT or Paxos are decentralised distributed algorithms, they affectively don't account for BAD actors. Raft is not BFT. The way RAFT works is that in an odd number of nodes in a cluster, it elects a leader and maintains a leader until that leader "works" or gets "voted out". The cluster maintains a state that participants agreed upon. The state is backed locally, does not need replication because RAFT achieves it by consensus. RAFT is a Master-slave replication algorithm, if a slave get's a request it "forwards it to the master" (if that request is a write) reads can be served from slaves.

Now if we don't want to go down the "RAFT" road, electing in a distributed system that creates a master/slave topology are called "Leader Election Algorithms", there are many we can choose from. One of the simplest we get from using the beam is the principle of distributed applications. Read more in the lovely erlang docs, here: http://erlang.org/doc/design_principles/distributed_applications.html

Jul 31 '19 14:07 InoMurko

After discussion we discovered a few critical requirements that proposed solution has to meet.

System cannot lose transactions, once accepted tx has to be included in future block(s)
All accepted txs have to be propagated in the same order to all nodes (block content)
One master node decides when to cut block and publish block hash to the root chain

From the above we can conclude that critical components of the system are:

shared, locked DB with formed blocks (only!), lock is held by master,
AMQP (fanout exchange model) that ensures message ordering for all txMQ queues.

slaves catch up via blocks in DB, txMQ ensures txs absent in formed blocks are available to the slaves if master goes down. This solution also results in consistent UTXO set distributed trough nodes, however its not critical, because slave in inconsistent UTXO state can recreate from scratch using blocksDB.

Genuine UTXO set is a result of appling all txs from blocksDB and is formed by master node. Master is responsible for forming next block from txMQ, publishing block to root chain (and ensures it gets mined), informing slave nodes about formed block (e.g. blknum, tx_count_included, so they can check they're consistent).

There can be only one master at a specified point of time, no 2+ nodes can write to txDB at the same time or send new blocks to root chain.

This is it that comes to my mind after the recap, please fill free to correct and extend.

Aug 02 '19 07:08 pnowosie

I would correct your critical requirements: One master node decides when to cut block and publish block hash to the root chain It is true that only one node has to cut a block and publish it, but that does not mean it needs to be the master node. Firstly, because publishing a block needs to be an atomic operation, and if forming a block means only addition of transactions (we're moving the mempool to the DB, correct?), then it does not matter who sends off the block, as long as it happens successfully only once. What we forgot at this point is that a block needs to be published. eventually! This means that there needs to be a mechanism in place that will ensure this actually happens in a reasonable timeframe. We could actually spend some time to figure out how soon and how much time we have to delay the operation in case of unplanned node behavior.

Next thing: AMQP (fanout exchange model) that ensures message ordering for all txMQ queues. The problematic part here is that in case master fails, AMQP is not able to give you guarantees that any other slave has the same picture as master previously had. This statement txMQ ensures txs absent in formed blocks are available to the slaves if master goes down will not hold, because AMQP alone is not ensuring consistency between nodes. Sure they can eventually get a TX (or not) but that does not mean it'll be in the same state as master was when it broke.

This is only a critique of your comment. I'll expand on a proposal as well.

Aug 02 '19 19:08 InoMurko

AMQP is not able to give you guarantees that any other slave has the same picture as master previously had ... AMQP alone is not ensuring consistency between nodes. Sure they can eventually get a TX (or not) but that does not mean it'll be in the same state as master was when it broke.

I think the proposition doesn't rely on that AMQP ensures consistency with master - it rather is this:

slaves catch up via blocks in DB

So the primary source of txs for any non-master is the DB. txMQ is really only ever read after master went down.

Aug 05 '19 08:08 pdobacz

It is true that only one node has to cut a block and publish it, but that does not mean it needs to be the master node.

This is by design, it is not an requirement.

In this proposition single point of truth is DB. If you wish you can allow every node prepare blocks then one which first commits it to the DB (or even on root chain?) wins. It's possible but makes no sense.

Instead we want master node have clear responsibilities and all other slaves are there to sync and be ready to take over master role. I think it's less important if slaves syncs from AMQP or waits for new block appear in DB, or some are not consistent with master. More important is no node is able to publish an unknown block hash to root chain.

Aug 05 '19 08:08 pnowosie

@pnowosie please address two other points: what role does MQ have? What happens when master fails with trasanctions in the mempool before the block was formed and published?

Aug 05 '19 08:08 InoMurko

However I think I can refine the above points to these:

System cannot lose transactions, once accepted tx has to be included in future block(s) (#774)
No node is ever able to publish an unknown block hash to root chain.

Aug 05 '19 08:08 pnowosie

what role does MQ have?

It serves as mempool for non-master node in case master will crash. But when master is able to commit a new block to the DB it becomes official new state for the entire cluster.

If no block is committed to the DB, and there is a suspicion that master died - other assigned master node catches with all blocks in DB, then applies txs from AMQP and forms new block.

(It is easy to say so, but we really don't want that the assumed down old-master recovers and start to serve its role further ;) )

Aug 05 '19 09:08 pnowosie

what happens with transactions not yet included in a block? those that currently live in

defp add_pending_tx(%Core{pending_txs: pending_txs, tx_index: tx_index} = state, %Transaction.Recovered{} = new_tx) do
    %Core{
      state
      | tx_index: tx_index + 1,
        pending_txs: [new_tx | pending_txs]
    }
  end

Aug 05 '19 09:08 InoMurko

what happens with transactions not yet included in a block?

Good point, it somehow slipped out of the picture. To ensure (1) we need to put received txs in fanout exchange so all nodes will received it. It's not perfectly clear to me how we're attached to the transaction's acceptance.

We have 2 options

Incoming tx comes to the master node, it applies tx on the state and if it's valid publishes it to fanout exchange and responds to the client (it is immediately known tx was accepted)
Tx first comes to the fanout exchange, then master is consuming its queue, applies tx on state, once valid tx will end up in new block. Here sender does not immediately know whether it was accepted, she knows it was received and tx's acceptance results from inclusion in the block.

Aug 05 '19 10:08 pnowosie

what happens with transactions not yet included in a block?

Wait, I think there might be a misunderstanding lurking here.

slave would not allow the txs from txMQ to go through to submit at all, unless they are becoming master at this instant. If a slave continues as slave it only reads txs from DB block. Once slave sees a new block from master it cleanses its txMQ from those txs.

Aug 05 '19 12:08 pdobacz

Slave consumers would leave TXs on the MQ broker?

Aug 05 '19 12:08 InoMurko

Slave consumers would leave TXs on the MQ broker?

yeah, or anywhere in the pipeline, where it would make sense, but before hitting the slave app (i.e. the OMG.State bit, where the pending_txs resides).

Aug 05 '19 12:08 pdobacz

If that's the case I don't see any benefit of having another two layers of (fairly complicated for our use case) infrastructure.

Nginx <-> (our consumer/producer) <->(Rabbit MQ) <-> (master/slaves child chains)

RabbitMQ cannot be used (or it's at least fairly unusal) as a reliable storage for high TPS.

Aug 05 '19 13:08 InoMurko

We have 2 options

    1. Incoming tx comes to the master node, it applies tx on the state and if it's valid publishes it to fanout exchange and responds to the client (it is immediately known tx was accepted)

    2. Tx first comes to the fanout exchange, then master is consuming its queue, applies tx on state, once valid tx will end up in new block. Here sender does not immediately know whether it was accepted, she knows it was received and tx's acceptance results from inclusion in the block.

... does that mean we have two MQ layers? If not, how does routing work? What happens if TX that is sent to fanout after master processed it cannot be applied on a slave node?
Would mean chaning the API? Is this something we can do? This does not address loosing transactions while they live on master node, correct?

I would go with either of these three approaches:
simply putting UTXOs in shared PG instance, together with append only log of transactions and an eventual block forming mechanism of these (atomic transaction-> take last 1000 transactions, form a block, publish, if succeeds, delete the transactions if not, release the lock). PG locks are being passed between nodes. It's fairly certain that highly concurrent API hits would resolve to timeouts - to avoid this one would need a retry mechanism.
_You don't need master/slave mechanism._

With master/slave
All nodes accept transactions but always forwards them to the master node, that writes in a shared PG instance. Master node does everything until a new master node gets elected. 
_This one might be the most sensible_. If the master nodes fails over, a new master get's elected and can continue from the shared PG.

_With RAFT and no shared PG instance._ Little to no infrastructure changes (comparing with others) but a fairly big chunk of reading and coding.

Aug 05 '19 13:08 InoMurko

Just one thing I want to mention: It'd be good if the approach taken here took into account the protocol extensions that might require merklization of the state (that is - the UTXO set or related data):

checkpointing (see here for background)
account exits (see here)
PoS constructions

Merklization in this context means effectively maintaining a merkle root of the data-structure, allowing users to prove the existence of specific state.

Not sure if this is a big deal or not, but it would be good to keep at the back of the head, so just throwing this into the air here.

Oct 24 '19 15:10 pdobacz

https://github.com/omisego/elixir-omg/pull/1066 is a small experiment I did, related to this topic here.

Oct 24 '19 15:10 pdobacz

elixir-omg elixir-omg copied to clipboard

Multiple Childchain state storage

elixir-omg
elixir-omg copied to clipboard