metrictank
metrictank copied to clipboard
dont hold write lock for entire duration of index delete calls
When an index Delete is called, the entire Delete operation is performed while a write lock is held. While the lock is held, no metrics can be ingested and no other index searches can run.
To avoid problems , the Delete function https://github.com/grafana/metrictank/blob/master/idx/memory/memory.go#L1492 should instead
- use a read lock and perform a find() to get all matching nodes, then release the read lock.
- for each matching node, acquire a read lock, then traverse the tree to find all matching leaf nodes, then release the read lock.
- for each matching leaf node, acquire a write lock, then delete the leaf node, which will cascade down the tree and clean up any branches that have no children.
eg something like
var deletedDefs []idx.Archive
pre := time.Now()
defer func() {
if len(deletedDefs) == 0 {
return
}
if m.findCache == nil {
return
}
// asynchronously invalidate any findCache entries
// that match any of the deleted series.
if len(deletedDefs) > findCacheInvalidateQueueSize {
m.findCache.Purge(orgId)
} else {
for _, d := range deletedDefs {
m.findCache.InvalidateFor(orgId, d.NameWithTags())
}
}
}()
m.RLock()
tree, ok := m.tree[orgId]
if !ok {
m.RUnlock()
return nil, nil
}
found, err := find(tree, pattern)
m.RUnlock()
if err != nil {
return nil, err
}
leafNodes := make([]*Node, 0)
for _, node := range found {
m.RLock()
// GetLeaves would be some recursive function like https://github.com/grafana/metrictank/blob/master/idx/memory/memory.go#L1537-L1552
leafNodes = append(leafNodes, node.GetLeaves())
m.RUnlock()
}
for _, node := range leafNodes {
m.Lock() // should probably use the same lock rate limiting that is used in Prune
deleted := m.delete(orgId, node, true, false)
deletedDefs = append(deletedDefs, deleted...)
m.Unlock()
}
statMetricsActive.DecUint32(uint32(len(deletedDefs)))
statDeleteDuration.Value(time.Since(pre))
return deletedDefs, nil
}
https://github.com/grafana/metrictank/pull/1897 might fix it, if not @robert-milan has another branch that should and will do a PR for