bee icon indicating copy to clipboard operation
bee copied to clipboard

Depth Package - storage incentives

Open istae opened this issue 2 years ago • 11 comments

reference https://hackmd.io/yzrmd3GhTDCyDnEkwzrjeg?view

Depth Package (needs a new name)

A new package that monitors the localstore and storage radius from batchstore to control puller syncing changes.

func init() {
    
    initiliazeDepthFromStatestore()

    waitNodeWarmup()

    if storage_depth == 0, 
        storage_depth = kademlia.connectionDepth()

    go manage()
    
}

func manage() {
    while {
        if reserve.Size() / reserve.Capacity() < 0.5 {
            storage_depth--
            kademlia.setDepth(storage_depth)
        }
        sleep(5 minutes)
    }
}

// reported by batchstore
func SetStorageRadius(storage_radius) {
    if oldStorageRadius >= storage_depth  && storage_radius < storage_depth {
        storage_depth = storage_radius
        kademlia.setDepth(storage_depth)
    }
    if storage_radius > storage_depth {
        storage_depth = storage_radius
        kademlia.setDepth(storage_depth)
    }
   oldStorageRadius = storage_radius
}

NOTES:

  • kademlia has to be aware of storage depth to tweak neighborhood radius to maintain full connectivity

istae avatar Jun 20 '22 08:06 istae

Notes from 23/6

  • improve manage loop by monitoring pullsync
    • depth may be decreased if after historical syncing for a particular bin, the reserve size is still below 50% capacity

istae avatar Jun 27 '22 08:06 istae

func manage() {

    secondsInHour := 60 * 60

    adaptationPeriod := false

    reserveNotHalfFull := func() bool {
        return reserve.Size() / reserve.Capacity() < 0.5
    }


    for {
        sleep(5 minutes)

        if reserveNotHalfFull() && !adaptationPeriod {
            adaptationPeriod = true
            timeAtStart := time.Now()
        } 

        if !reserveNotHalfFull() {
            adaptationPeriod = false
        }
        
        // given the current pull rate (chunks per second), and the time left (1 hour minus the seconds since we started), is the current rate enough to fill half of the reserve on top of what's already there
        if adaptationPeriod && pullsync.Rate() * (secondsInHour - seconds.Since(timeAtStart)) < (reserve.Capacity() / 2 - reserve.Size()) {
            storage_depth--
            kademlia.setDepth(storage_depth)
        }
    }
}

istae avatar Jun 27 '22 12:06 istae

Is there a possible oscillation here? I have one batch that just unreserved 3,579,137 chunks in my reserve. Since the reserve capacity is just under 5,000,000 chunks, that means that a single storage_depth increment dropped my reserve by more than 50%. This code will then decrease the depth allowing that batch to re-fill my reserve causing it to eventually hit capacity which will evict those same 3.5 million chunks. Rinse and repeat?

These are some additional trace logs in my node specifically tracking the reserve's use of the pin counter:

time="2022-06-23T08:43:49+02:00" level=debug msg="batchstore: Unreserve callback batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df storage radius
0"

time="2022-06-23T08:43:49+02:00" level=debug msg="batchstore: Unreserve callback batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df storage radius
1"
time="2022-06-23T08:44:54+02:00" level=trace msg="pinTrace:evictReserve: Unreserved 1711479 chunks in batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc597106
5c1df"

time="2022-06-23T18:56:55+02:00" level=debug msg="batchstore: Unreserve callback batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df storage radius
 2"
time="2022-06-23T19:00:00+02:00" level=trace msg="pinTrace:evictReserve: Unreserved 2461269 chunks in batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc59710
65c1df"

time="2022-06-24T21:07:22+02:00" level=debug msg="batchstore: Unreserve callback batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df storage radius
 3"
time="2022-06-24T21:12:27+02:00" level=trace msg="pinTrace:evictReserve: Unreserved 3579137 chunks in batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc59710
65c1df"

Batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc59710 is a really huge dataset, namely: "stamps": [ { "batchID": "0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df", "utilization": 3884, "usable": true, "label": "20210916-OSM-Map", "depth": 30, "amount": "441429710", "bucketDepth": 16, "blockNumber": 18125458, "immutableFlag": false, "exists": true, "batchTTL": 520364702 } ]

And I'm routinely fetching large portions of that stamped dataset into my bee node.

ldeffenb avatar Jun 27 '22 12:06 ldeffenb

StorageRadius-1 StorageRadius-2 And headed to yet another Unreserve eviction when Storage Radius goes to 4, hopefully later today or tomorrow.

ldeffenb avatar Jun 27 '22 12:06 ldeffenb

And a new behavior. Any idea why the Storage Radius incremented twice in quick succession? I'll be providing logs for this later today.

time="2022-06-28T18:18:12+02:00" level=debug msg="batchstore: Unreserve callback batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df storage radius
 4"
time="2022-06-28T18:28:04+02:00" level=trace msg="pinTrace:evictReserve: Unreserved 5445988 chunks in batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc59710
65c1df"
time="2022-06-28T18:28:05+02:00" level=debug msg="batchstore: Unreserve callback batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df storage radius
 5"
time="2022-06-28T18:30:29+02:00" level=trace msg="pinTrace:evictReserve: Unreserved 2719337 chunks in batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc59710
65c1df"

ldeffenb avatar Jun 28 '22 17:06 ldeffenb

StorageRadius-3 StorageRadius-4

ldeffenb avatar Jun 28 '22 17:06 ldeffenb

@ldeffenb can you post the response from /batches and /reservestate what version of bee are you running as well?

istae avatar Jun 28 '22 18:06 istae

bee version - 1.6.2-684a7384-dirty Dirty is due to additional pin counter logging, pinning and stewardship do not traverse manifests, increased kademlia connection limits

/reservestate: { "radius": 11, "storageRadius": 5, "commitment": 7803764736 }

(Never knew about this one! Learn something new every day!) batches.txt

And for extra information, here's /stamps/0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df/buckets from the node that owns the stamp batch: buckets.txt

And /stamps from that same owning node: { "batchID": "0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df", "utilization": 3884, "usable": true, "label": "20210916-OSM-Map", "depth": 30, "amount": "441429710", "bucketDepth": 16, "blockNumber": 18125458, "immutableFlag": false, "exists": true, "batchTTL": 520258952 }

ldeffenb avatar Jun 28 '22 19:06 ldeffenb

Notes from 7/27

  1. 2 hours to reach 50% utilization can only be done with at least 2MB/s pullsync rate, we probably should go higher with this time range to stay on safe side, if we reach 50% utilization it should be noticed earlier anyway

suggestion: increase time window to 5 hours?

  1. checking every 5 minutes might be too often, it can lead a node with lower effective observed pull rate than the minimum presumed pull rate to reach storage depth 0 too soon

suggestion:

  • keep the check every 5 minutes only for the purposes of checking whether reserve is (still?) above 50%,
  • introduce new interval check (every 1 hour or 30 minutes or something) to actually decrease storage depth if utilization doesnt seem to reach 50% utilization in given time based on current rate?
  1. related but borderline in scope: radius shrinking incorrectly leads to a matching storage radius shrinking, which is not correct logically

suggestion: replace storage radius tracking radius with storage radius tracking reserve size

  1. 'fully synced indicator' as a feature

suggestion: after reaching 50% utilization, start a sync-completion period that gives just as much time to complete as the original syncing of 50% would have taken with the rate observed during reaching 50%.

so maybe if we already started the node with 25%, and reaching 50% took 1 hour, start a 'sync-completion period' of 2 hours, after which we can set 'fully synced indicator' to true, and keep it true until storage radius decreases (noticed from reservesize after garbage collection, or/and the 5 minute interval check for adaptation period)

metacertain avatar Jul 27 '22 16:07 metacertain

Notes from 7/29

  1. for the criteria for decreasing storage radius, remaining time multiplied by observed rate based estimation can be simplified to observing historical pullsync rate dropping to 0, which can still be checked every 5-10 minutes.

  2. however, even if that rate is measured, some things left to accomodate for are:

  • are there any neighbors connected? if not, storage radius should not decrease
  • are the connected neighbors accept pullsync request from the node? if for some reason not, we can still erronously decrease storage radius indefinietly

suggestion:

  • check number of neighbors (connections from PO[min(storage depth, connection depth)] ) > 2 as a further condition to decrease storage radius
  • in pullsync a node needs to accept pullsync requests from below current depth PO nodes (from depth - 3 for example), current depth meaning PO[min(storage depth, connection depth)]

metacertain avatar Aug 01 '22 11:08 metacertain

Observation on 8/2

When the reserve reaches the desired size after decrementing the storage radius, if the syncing rate is high enough such that the reserve becomes full and a large eviction causes the size to drop below the half mark, storage radius may begin to oscillate, as such, we should increment the storage radius by one to prevent oscillation.

istae avatar Aug 02 '22 08:08 istae

Release Testing

Storage Radius and Connection Depth Dashboard

A way to confirm that monitor is working as intended is to keep track of storage radius and connection depth values. The depth reported by kademlia should be the minimum of storage radius and connection depth.

Full Nodes Testing

  • [x] storage radius for nuked nodes eventually comes to a meaningful radius
  • [x] on restarts, the storage radius continues at the same value

Light nodes Testing

  • [x] kademlia depth is solely based on connection depth

istae avatar Aug 18 '22 09:08 istae