cardano-db-sync icon indicating copy to clipboard operation
cardano-db-sync copied to clipboard

`consumed_by_tx_id` on `ma_tx_out`

Open saghen opened this issue 1 year ago • 6 comments

As far as I'm aware, it's currently not possible to get all unspent multi asset transaction outputs for a specific multi asset without doing one of:

  • Scanning all rows of ma_tx_out for a given multi asset joined with tx_out filtered on consumed_by_tx_id
  • Scanning all consumed_by_tx_id IS NULL on tx_out joined with ma_tx_out filtered for a given multi asset
  • Creating a materialized view

The former ends up being far more performant, but still leads to slow queries for multi assets with many transaction outputs. Here's an example for the /assets/:asset/addresses query from Blockfrost:

SELECT
  txo.address AS "address",
  SUM(quantity) AS "quantity"
FROM ma_tx_out mto
  JOIN multi_asset ma ON (mto.ident = ma.id)
  JOIN tx_out txo ON (txo.id = mto.tx_out_id)
  JOIN tx ON (tx.id = txo.tx_id)
WHERE txo.consumed_by_tx_id IS NULL
  AND (encode(ma.policy, 'hex') || encode(ma.name, 'hex')) = $1
  -- don't count utxos that are part of transaction that failed script validation at stage 2
  AND tx.valid_contract = 'true'
GROUP BY txo.address
ORDER BY MIN(tx.id) ASC
LIMIT 100;

Would it make sense to include a consumed_by_tx_id field on ma_tx_out alongside the existing field on tx_out for this case?

saghen avatar Nov 29 '24 21:11 saghen

Would it help to have an optional tx_out.ma_tx_out field where all multiassets are stored? It could be in the form of json, to allow other more fine grained queries, using the jsonb postgres type? I'm not quite sure how possible or easy to implement this could be.

kderme avatar Nov 30 '24 09:11 kderme

That would avoid a JOIN on the majority of our tx_out queries as well, since we almost always get the multi assets, so that sounds great!

saghen avatar Nov 30 '24 16:11 saghen

Would it help to have an optional tx_out.ma_tx_out field where all multiassets are stored? It could be in the form of json, to allow other more fine grained queries, using the jsonb postgres type? I'm not quite sure how possible or easy to implement this could be.

As per my understanding, that would likely not work well for querying. ma_tx_out is a already over 120GB and only increasing exponentially.

Adding it to tx_out will negatively impact tx_out performance itself (essentially drastically increase tuning params and in turn resources, especially IOPs).

Not to mention - the indexing you'd need to filter tx_out by assets in this case would add tremendous cost to query planner.

A better alternative would be adding similar consumed (or spent) field to ma_tx_out, as part of using tx_out's consumed_by_tx_id config preset. Having said that, looking at few years down the line - projects should really start splitting out a full-history instance [useful for explorers and projects that show history tx details including input addresses] from utxo-only instances , which are far more performant for such querying. One can use bootstrap to initially sync dbsync w/o tx_out|ma_tx_out and then once synched switch to prune utxos. A combination of the above would really help out users query consumption.

CC: @sgillespie - since you're working on this, to consider feasibility of spent field instead of above approach.

rdlrt avatar Dec 07 '24 01:12 rdlrt

Adding a new updateable field in ma_tx_out can be very slow, since this table is huge and updating is already quite slow for tx_out.

DBSync schema is quite normalised and requires joins. Denormalising it by merging tables together can help for specific queries which fetch all the MA from a tx or output. It's true that queries based on MA policy won't work that well or won't work at all (we can try with jsonb or postgres arrays but it will be worse than the current solution). So it really depends on what are the frequent queries. If users have both queries based on policy and based on tx/output I guess the current schema would be ideal, or even allow both a MA table and a MA field in tx_out at the cost of more disk.

One can use bootstrap to initially sync dbsync w/o tx_out|ma_tx_out and then once synched switch to prune utxos.

We want to use the bootstrap idea more, even for live stake, live voting power etc. The consumed option is a compromise: users still get the full history, but it's also relatively fast to query the current state.

By the way with the opportunity of this issue and other older ones, I've considered the possibility of porting db-sync to Mongodb, which provides better support for denormalised data.

kderme avatar Dec 09 '24 17:12 kderme

So it really depends on what are the frequent queries. If users have both queries based on policy and based on tx/output I guess the current schema would be ideal, or even allow both a MA table and a MA field in tx_out at the cost of more disk.

From koios queries (ordered based on past 20 days): policy => addresses (alongwith their balance) asset => utxos asset => addresses (alongwith their balance) asset => info (current balance and meta info) asset => txs

The change above will likely not make sense for many of these, especially if ma_tx_out stays and we're 'adding' more data to tx_out. The current layout atleast keeps tx_out away from vast spam of ma_tx_out entries - hence, I mentioned it's a bad idea to change that (or add an option over complicating further - we hardly have enough test data coverage across various options).

We want to use the bootstrap idea more, even for live stake, live voting power etc. The consumed option is a compromise: users still get the full history, but it's also relatively fast to query the current state.

Agreed - (sorry my next statement will be out-of-topic on this issue) I'd say that's a preset that makes most sense (part missing is starting dbsync with bootstrap option, once on tip automatically stop taking ledger snapshots and start pruning). It is potentially also useful to allow pruning frequency configurable as deletes can take a while, some may prefer more frequent deletes, while others may prefer waiting for it once per epoch.

rdlrt avatar Dec 18 '24 07:12 rdlrt

For anyone interested in trying this out, you may either use pg_ivm or postgres triggers to create an enriched ma_tx_out table. Notably, I believe my implementation depends on DB Sync inserting the ma_tx_out after the tx_out. If that's not acceptable for you and you find a solution, please share it!

ma_tx_out_enriched transaction
BEGIN;

CREATE TABLE ma_tx_out_enriched (
  id bigint PRIMARY KEY,
  tx_out_id bigint,
  quantity word64type,
  ident bigint,

  tx_id bigint,
  consumed_by_tx_id bigint,
  inline_datum_id bigint
);

-- Create trigger functions
-- Function to handle ma_tx_out changes
CREATE OR REPLACE FUNCTION refresh_ma_tx_out_enriched_ma_tx_out()
RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' THEN
    INSERT INTO ma_tx_out_enriched (id, tx_out_id, quantity, ident, tx_id, consumed_by_tx_id, inline_datum_id)
    SELECT 
      NEW.id,
      NEW.tx_out_id,
      NEW.quantity,
      NEW.ident,
      tx_out.tx_id,
      tx_out.consumed_by_tx_id,
      tx_out.inline_datum_id
    FROM tx_out
    WHERE tx_out.id = NEW.tx_out_id;
    
  ELSIF TG_OP = 'UPDATE' THEN
    UPDATE ma_tx_out_enriched 
    SET 
      tx_out_id = NEW.tx_out_id,
      quantity = NEW.quantity,
      ident = NEW.ident,
      consumed_by_tx_id = tx_out.consumed_by_tx_id,
      inline_datum_id = tx_out.inline_datum_id
    FROM tx_out
    WHERE ma_tx_out_enriched.id = NEW.id AND tx_out.id = NEW.tx_out_id;
    
  ELSIF TG_OP = 'DELETE' THEN
    DELETE FROM ma_tx_out_enriched WHERE id = OLD.id;
  END IF;
  
  RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Function to handle tx_out changes
CREATE OR REPLACE FUNCTION refresh_ma_tx_out_enriched_tx_out()
RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' THEN
    -- Update existing ma_tx_out records that reference this new tx_out
    UPDATE ma_tx_out_enriched 
    SET
      tx_id = NEW.tx_id,
      consumed_by_tx_id = NEW.consumed_by_tx_id,
      inline_datum_id = NEW.inline_datum_id
    WHERE tx_out_id = NEW.id;
    
  ELSIF TG_OP = 'UPDATE' THEN
    -- Update consumed_by_tx_id for all related ma_tx_out records
    UPDATE ma_tx_out_enriched 
    SET
      tx_id = NEW.tx_id,
      consumed_by_tx_id = NEW.consumed_by_tx_id,
      inline_datum_id = NEW.inline_datum_id
    WHERE tx_out_id = NEW.id;
    
  ELSIF TG_OP = 'DELETE' THEN
    -- Remove records that reference the deleted tx_out
    DELETE FROM ma_tx_out_enriched WHERE tx_out_id = OLD.id;
  END IF;
  
  RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

-- Create triggers
CREATE TRIGGER trg_ma_tx_out_enriched_ma_tx_out
  AFTER INSERT OR UPDATE OR DELETE ON ma_tx_out
  FOR EACH ROW
  EXECUTE FUNCTION refresh_ma_tx_out_enriched_ma_tx_out();

CREATE TRIGGER trg_ma_tx_out_enriched_tx_out
  AFTER INSERT OR UPDATE OR DELETE ON tx_out
  FOR EACH ROW
  EXECUTE FUNCTION refresh_ma_tx_out_enriched_tx_out();

-- Populate with existing data
INSERT INTO ma_tx_out_enriched (id, tx_out_id, quantity, ident, tx_id, consumed_by_tx_id, inline_datum_id)
SELECT
  ma_tx_out.id,
  ma_tx_out.tx_out_id,
  ma_tx_out.quantity,
  ma_tx_out.ident,
  tx_out.tx_id,
  tx_out.consumed_by_tx_id,
  tx_out.inline_datum_id
FROM ma_tx_out
  JOIN tx_out ON tx_out.id = ma_tx_out.tx_out_id;

-- Add indexes
CREATE INDEX ma_tx_out_enriched_tx_out_id ON ma_tx_out_enriched(tx_out_id);
CREATE INDEX ma_tx_out_enriched_tx_out_id_consumed_by_tx_id_ident ON ma_tx_out_enriched(ident, consumed_by_tx_id, tx_out_id);

COMMIT;

The included indices are designed to support querying current UTxOs, as well as UTxOs at a specific point in time. For the latter case, you'll want to make two queries (consumed_by_tx_id > your_tx_id union with consumed_by_tx_id IS NULL). Postgres seems to fallback to filtering on the index, rather than using an index condition, if you try to do consumed_by_tx_id > your_tx_id AND consumed_by_tx_id IS NULL. I've been getting the max tx_id for a specific time via:

SELECT MAX(tx.id) as "end_tx!"
FROM tx
WHERE tx.block_id = (
    SELECT id
    FROM block
    WHERE time < YOUR_DATE AND tx_count > 0 AND block_no IS NOT NULL
    ORDER BY time DESC
    LIMIT 1
)

Edit: You may also include the address in the ma_tx_out_enriched table but you may want to use the { "tx_out": { "use_address_table": true } } option to avoid ballooning your DB size.

saghen avatar Jul 07 '25 20:07 saghen