localstore write-in-place inconsistency bug

Open acud opened this issue 3 years ago • 0 comments

There's a bug in the localstore which is related to the postage stamp indexing scheme which results in an HTTP 500 when trying to upload a file through the API.

The error would usually look like:

level=debug msg="bytes upload: split write all: put upload: same slot remove: get value: leveldb: not found" traceID=....

This relates to the code in pkg/localstore/mode_put.go:

func (db *DB) putUpload(
	batch *leveldb.Batch,
	loc *releaseLocations,
	binIDs map[uint8]uint64,
	item shed.Item,
) (exists bool, gcSizeChange int64, err error) {

	previous, err := db.postageIndexIndex.Get(item)
	if err != nil {
		if !errors.Is(err, leveldb.ErrNotFound) {
			return false, 0, fmt.Errorf("postage index get: %w", err)
		}
	} else {
		if item.Immutable {
			return false, 0, ErrOverwrite
		}
		// if a chunk is found with the same postage stamp index,
		// replace it with the new one only if timestamp is later
		if !later(previous, item) {
			return false, 0, nil
		}
		_, err = db.setRemove(batch, previous, true)
		if err != nil {
			return false, 0, fmt.Errorf("same slot remove: %w", err)  // <----- Error occurs here
		}

		previousIdx, err := db.retrievalDataIndex.Get(previous)
		if err != nil {
			return false, 0, fmt.Errorf("could not fetch previous item: %w", err)
		}

		l, err := sharky.LocationFromBinary(previousIdx.Location)
		if err != nil {
			return false, 0, err
		}

		loc.add(l)
}
...

The postage indexing scheme hands out slots within a postage batch. More exactly, there are specific indexes inside every collision bucket in the postage batch. Once a chunk gets stamped, the number of collisions seen within the bucket so far is given out as the postage batch index and is serialized as part of the postage stamp on the chunk while being stamped by the Stamper. The localstore is aware of this index. This is so that users can buy storage space that can be later reused to store other content (hence the timestamp check when storing), and so that postage stamps will become multi-use rather than a one-off event.

The bug at hand manifests in the following manner (that's at least, to the best of my understanding):

buy a postage stamp x, use it normally
db nuke the node (prior to #3011), stampissuers would get deleted
the forgetful node would then resync the chain data, and would reinstate the stampissuer, but all collision buckets would be zeroed out
the node would then pull sync the content from the neighborhood, and some chunks that were already previously stamped with postage stamp x would find their way back into the node
reuse the same postage stamp x and upload content
at some point we would get to stamp a chunk with the same collision bucket as we did before, however, it is not the same chunk, and has a different address now, which triggers the aforementioned problematic flow. The localstore for some reason cannot find the old chunk that needs to be deleted and replaced.

It is not clear whether this is due to:

GC already collecting to old chunk but simply not removing the index entry
Some serialization problem with the fields that
decorate the chunk
inside the localstore index definitions
Possible problem with the postage stamp data coming in on the syncing protocol
Some slice share bug

The change in #3011 should improve the situation, but the problem still need to be traced and eventually resolved (or, alternatively, abandon the "feature" for now).

Jun 23 '22 07:06 acud