erigon icon indicating copy to clipboard operation
erigon copied to clipboard

seg: set stale files not visible until recalcVisibleFiles done

Open stevemilk opened this issue 1 year ago • 6 comments

stevemilk avatar Oct 21 '24 09:10 stevemilk

FYI: seems i trying something similar: https://github.com/erigontech/erigon/pull/12380

AskAlexSharov avatar Oct 21 '24 10:10 AskAlexSharov

FYI: seems i trying something similar: #12380

@AskAlexSharov Got it. one thing special in this PR is i try to keep closeAndRemoveFiles thread safe without lock, not sure if it's within your next step 's plan, so explain a bit:

I add an indicator stale in visibleSegment struct. After recalcVisibleSegments, unused old visible segments will be marked as stale. closeAndRemoveFiles happens only when refcount==0 && stale && src.canDelete. This can thread safe without lock because new rotx will be unable to read stale visible Segments.

stevemilk avatar Oct 21 '24 14:10 stevemilk

I don’t understand how it can be called from multiple goroutines.

In my head 2parts 1: // recalcVisibleFiles do skip CanDelete==true // means new RoTx can’t see it

2: newValue := refcnt.AtomicDecrement() If newValue == 0 { closeAndDel() }

Atomics guarantee that all goroutines (and there can’t be new readers because point 1).

AskAlexSharov avatar Oct 22 '24 03:10 AskAlexSharov

what I'm thinking here is exactly point1, canDelete==true seems not enough.

assuming two consecutive readers read segX during merge:

  1. merge step1: integrateMergedDirtyFiles marks segX as canDelete = true
  2. rotx_1 read, refCnt=1
  3. rotx_1 closing, refCnt=0 and canDelete = true
  4. rotx_2 read, refCnt=1, because recalcVisibleFiles has not been called yet, segX is still visible
  5. rotx_1 closeAndRemove segX
  6. rotx_2 access nil segX
  7. merge step2: recalcVisibleFiles, segX is removed from visible list

stevemilk avatar Oct 22 '24 07:10 stevemilk

rotx_1 read, refCnt=1 and rotx_2 read, refCnt=1 - "read atomics" it's kind-of anti-pattern. Need do "compare_and_swap" things on them: we do leftReaders := idx.readAheadRefcnt.Add(-1)

merge itself is also reader Aggregator.mergeLoopStep does aggTx := a.BeginFilesRo() (i'm not sure if RoSnapshots follow this practice - i need re-check)

You right - we can make sure that we set canDelete.Store(true) after recalcVisibleFiles:

a.integrateMergedDirtyFiles(outs, in) // can don't set `canDelete`
a.recalcVisibleFiles(a.DirtyFilesEndTxNumMinimax()) 
a.cleanAfterMerge(in) // set `canDelete`

(it happening inside a.BeginFilesRo() section)

AskAlexSharov avatar Oct 22 '24 09:10 AskAlexSharov

thanks, you're correct about merge, i will work on point3 later

stevemilk avatar Oct 22 '24 16:10 stevemilk

Auite a lot of changes landed in: https://github.com/erigontech/erigon/pull/12339 need somehow merge main

AskAlexSharov avatar Oct 30 '24 02:10 AskAlexSharov

delSeg.canDelete.Store(true)
if delSeg.refcount.Load() == 0 {
    delSeg.closeAndRemoveFiles()
}

we have above code to determine closing files or not. But in fact refcount cannot guard files If the code is called by multiple goroutines because atomic value can only keep itself atomic. Two ways to promise this never happen: 1. add a mutex 2. refcount never be 0 here(start a rotx).

merge and prune both call this code. merge is also a reader so refcount.Load()!=0, but prune is not a reader and does not use a mutex.

so the first modification in this PR is creating a view before prune, also this means we don't need if delSeg.refcount.Load() == 0.

The second modification is to implement this:

You right - we can make sure that we set canDelete.Store(true) after recalcVisibleFiles:

a.integrateMergedDirtyFiles(outs, in) // can don't set `canDelete`
a.recalcVisibleFiles(a.DirtyFilesEndTxNumMinimax()) 
a.cleanAfterMerge(in) // set `canDelete`

(it happening inside a.BeginFilesRo() section)

we still need mark file as canDelete during merge or prune to skip recalc, so i add another flag stale to indicate that files are not visible after recalc and can be remove.

@AskAlexSharov I would appreciate any corrections

stevemilk avatar Nov 28 '24 09:11 stevemilk

delSeg.canDelete.Store(true) // called by Last reader - at it’s Close/Rollback if delSeg.refcount.Load() == 0 { delSeg.closeAndRemoveFiles() }

AskAlexSharov avatar Nov 30 '24 04:11 AskAlexSharov

delSeg.canDelete.Store(true) if delSeg.refcount.Load() == 0 { delSeg.closeAndRemoveFiles() } This code can’t protect from concurrency - you are roght. But we don’t have this code.

we have: refCnt := src.refcount.Add(-1) if refCnt == 0 && src.canDelete.Load() { and this code Can protect from concurrency: all readers will see different refCnt

https://github.com/erigontech/erigon/blob/a0889a7c552ead7513bb30aab16feaa63342f4e8/erigon-lib/state/history.go#L1118

AskAlexSharov avatar Nov 30 '24 04:11 AskAlexSharov

i mean here, during the Prune: https://github.com/erigontech/erigon/blob/baf0a609131bd09f9a5f138be025c5ccd2ca1485/turbo/snapshotsync/snapshots.go#L1281

stevemilk avatar Nov 30 '24 16:11 stevemilk

i didn't know we have it. yes if sn.refcount.Load() == 0 { is a logical race.

  • "last reader" must close files. isLastReader := refcount.Add(-1) == 0. merge - is also a reader.
  • only Merge does canDelete.Store(true) - for simplicity.

Seems if out.refcount.Load() == 0 { also exists in /erigon-lib/state/files_item.go. so weird. but it works just because it happening inside BeginFilesRo :-) seem i need re-visit this place.

AskAlexSharov avatar Dec 03 '24 04:12 AskAlexSharov

one check seems halt, try to re-trigger

stevemilk avatar Dec 05 '24 16:12 stevemilk