erigon icon indicating copy to clipboard operation
erigon copied to clipboard

Dirty Segment re-open is not internally thread safe

Open mh0lt opened this issue 1 year ago • 1 comments

These code sections are not thread safe, becuase testing the refcount does not protect the variables in the critical sections from multiple access. This is becuase nothing stops the refcount from incrementing after it is accesses, but before the code section is completed:

 func (s *DirtySegment) Reopen(dir string) (err error) {
	if s.refcount.Load() == 0 {
		s.closeSeg()
		s.Decompressor, err = seg.NewDecompressor(filepath.Join(dir, s.FileName()))
		if err != nil {
			return fmt.Errorf("%w, fileName: %s", err, s.FileName())
		}
	}
	return nil
}
func (s *DirtySegment) ReopenIdxIfNeed(dir string, optimistic bool) (err error) {
	if len(s.Type().IdxFileNames(s.version, s.from, s.to)) == 0 {
		return nil
	}

	if s.refcount.Load() == 0 {
		err = s.reopenIdx(dir)

		if err != nil {
			if !errors.Is(err, os.ErrNotExist) {
				if optimistic {
					log.Warn("[snapshots] open index", "err", err)
				} else {
					return err
				}
			}
		}
	}

	return nil
}

At the moment this is protected by a hot fix which locks the dirty segments during tx creation e.g.

This is not ideal and is likely to be error prone because:

  • It will not be obvious to callers of BeginRo that they need to do this
  • The lock is overkill as it locks the whole segment when a local lock will suffice and is likely to cause performance issues due to lock contention.
func (s *RoSnapshots) View() *View {
	s.visibleSegmentsLock.RLock()
	defer s.visibleSegmentsLock.RUnlock()

	var sgs btree.Map[snaptype.Enum, *RoTx]
	s.segments.Scan(func(segtype snaptype.Enum, value *Segments) bool {
		// BeginRo increments refcount - which is contended
		s.dirtySegmentsLock.RLock()
		defer s.dirtySegmentsLock.RUnlock()
		sgs.Set(segtype, value.BeginRo())
		return true
	})
	return &View{s: s, VisibleSegments: sgs, baseSegType: coresnaptype.Transactions} // Transactions is the last segment to be processed, so it's the most reliable.
}

A possible solution to this is to use an RWMutex with try lock instead, the protected code would then look like this:

 func (s *DirtySegment) Reopen(dir string) (err error) {
	if s.lock.TryLock() {
                err = func() error {
                    defer s.lock.Unlock()
		    s.closeSeg()
		    s.Decompressor, err = seg.NewDecompressor(filepath.Join(dir, s.FileName()))
                    return err
                   }()
		   if err != nil {
			return fmt.Errorf("%w, fileName: %s", err, s.FileName())
		  }
	}
	return nil
}

mh0lt avatar Oct 14 '24 10:10 mh0lt

Step towards: https://github.com/erigontech/erigon/pull/12380 ( i will merge it after alpha5 )

In my head:

  • .View() must be lock-free: means it will do only s.visible.Load() (where visible is atomic.Pointer to a slice. this slice is pre-calculated in background by recalcVisibleFiles())
  • .View() will not touch dirty field - so it will not need dirtyMutex.lock()

Seems we are couple PR's from there.

AskAlexSharov avatar Oct 22 '24 06:10 AskAlexSharov

seems fixed. feel free to re-open if not.

AskAlexSharov avatar Nov 15 '24 02:11 AskAlexSharov