daml icon indicating copy to clipboard operation
daml copied to clipboard

transparently refresh query store on pruning error

Open S11001001 opened this issue 3 years ago • 4 comments

In the DB update case (i.e. internally-stored offsets), if offset FAILED_PRECONDITION, under read committed,

  1. Begin
  2. Fetch all offsets for the template ID, including unowned
  3. Let P = all saved offset parties for template ID
  4. If request parties ⊇ P, skip all subsequent party-set checks
  5. Let pruned-P = the transitive closure of
    1. request parties ∪
    2. the (signatories ∪ observers) ∩ P of contracts visible to pruned-P
  6. Likewise, pruned-P ⊇ P simplifies the following checks.
  7. Delete all contracts of tpid where (pruned-P ∩ P ⊇ (signatories ∪ observers) ∩ P) (all contracts of tpid, if (4) or (6))
    1. This condition appears redundant, when we could just check whether pruned-P ∩ (signatories ∪ observers) is non-empty. Its purpose is to deal with two race conditions:
      1. P expanded after (3) (TODO SC not quite)
      2. contracts were added after (5) that would expand pruned-P
  8. Compare-and-delete offsets for (template, pruned-P). Deleting offsets unconditionally is not concurrent-safe; we may have inserted new contracts that have not been deleted!
    1. If compare-and-delete fails for any pruned-P, commit and restart from (1) with pruned-P as the new request parties.
      1. TODO SC this is not quite good enough
  9. Commit
  10. Ensure that there are no DB offsets for (tpid, request parties); if there are, restart from (1)
    1. TODO SC this is livelock-y, doesn't "monotonically advance time" like all our other concurrent update loops; might be able to drop it. A proper (8i) check should preclude it
  11. The observation of (tpid, request parties) should now be pruning-clean; re-fill from ACS+txns and proceed as normal

I believe this may complicate some concurrent update scenarios for those updates; still considering them.

S11001001 avatar Apr 25 '22 15:04 S11001001

Will assign this @S11001001 for now because he already has an ongoing PR for this (ignoring the fact that it has been open for more than two weeks already).

realvictorprm avatar May 06 '22 11:05 realvictorprm

I want to step back a bit and clarify our performance requirements. We have three basic approaches with three levels of complexity:

  1. Prune the bare minimum slices of the ACS. This is what #13529 moves towards and what is described in this issue description.
  2. Prune the whole template ID. Whereas party visibility slices overlap (hence all the extra calculation described for (1)), template ID subsets of the contract table never overlap.
  3. Prune the whole contract table. Obviously the simplest, but with the same implications that resetting the query store entirely would have if the user did it manually.

Keep in mind that query store pruning can happen at unexpected times. It is perfectly possible for a prune to happen on the ledger, many query-store updates and queries to happen successfully, filling more of the contracts table, and then only some time later running into the invalidated offset, triggering the query store pruning.

Each of the following factors suggests we should have more complexity in our pruning approach, i.e. an earlier item in the above list:

  • more frequent pruning (@cocreature mentions that we expect 1 day-6 months between ledger prunes, if used at all)
  • more parties with different query-endpoint access patterns
  • less overlap between parties' contract visibility
  • larger ACSes

S11001001 avatar May 18 '22 19:05 S11001001

@S11001001 If this is still blocked, what is it blocked by?

ray-roestenburg-da avatar Sep 12 '22 08:09 ray-roestenburg-da

@ray-roestenburg-da Discussing and determining which of the three approaches in the preceding comment has the right complexity/concurrency tradeoffs.

S11001001 avatar Sep 12 '22 15:09 S11001001

I implemented approach 2 in https://github.com/digital-asset/daml/pull/17167

raphael-speyer-da avatar Nov 06 '23 00:11 raphael-speyer-da