blog 动手写分布式缓存 - GeeCache第四天一致性哈希(hash)

https://geektutu.com/post/geecache-day4.html

7天用 Go语言/golang 从零实现分布式缓存 GeeCache 教程(7 days implement golang distributed cache from scratch tutorial)，动手写分布式缓存，参照 groupcache 的实现。本文介绍了一致性哈希(consistent hashing)的原理、实现以及相关测试用例，一致性哈希为什么能避免缓存雪崩，虚拟节点为什么能解决数据倾斜的问题。

Feb 16 '20 17:02 geektutu

妙啊

Mar 12 '20 10:03 xiaoxfan

开始感受到基础不够扎实对我造成的伤害了

Apr 16 '20 08:04 FWangZil

return m.hashMap[m.keys[idx%len(m.keys)]] 这个idx不是一定小于len(m.keys)吗那么idx%len(m.keys)不就等于idx吗

Apr 16 '20 15:04 walkmiao

正文中写了，idx 可能等于 len(m.keys)，key 比 m.keys 都大的时候。

Apr 18 '20 01:04 geektutu

@walkmiao return m.hashMap[m.keys[idx%len(m.keys)]] 这个idx不是一定小于len(m.keys)吗那么idx%len(m.keys)不就等于idx吗

sort.Search(...)函数点进去看一下就知道了，Search returns the first true index. If there is no such index, Search returns n.

Apr 18 '20 06:04 kkBill

好的谢谢 lch 邮箱：[email protected] 签名由网易邮箱大师定制在2020年04月18日 14:08，Jinyong Mao 写道： @walkmiao return m.hashMap[m.keys[idx%len(m.keys)]] 这个idx不是一定小于len(m.keys)吗那么idx%len(m.keys)不就等于idx吗 sort.Search(...)函数点进去看一下就知道了，Search returns the first true index. If there is no such index, Search returns n. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Apr 19 '20 00:04 walkmiao

通俗易懂！hashring这里要不要考虑加上mutex ？

May 05 '20 03:05 ghost

是否应该加上删除节点的操作，这样当节点退出的时候，可以把缓存分摊给其他节点

May 08 '20 11:05 zhaozong

对，我也这么想。但是那样就更复杂了，估计要第八天

May 08 '20 17:05 ghost

@zhaozong 是否应该加上删除节点的操作，这样当节点退出的时候，可以把缓存分摊给其他节点

删除只需要删除掉节点对应的虚拟节点和映射关系，至于均摊给其他节点，那是删除之后自然会发生的

// Remove use to remove a key and its virtual keys on the ring and map
func (m *Map) Remove(key string) {
	for i := 0; i < m.replicas; i++ {
		hash := int(m.hash([]byte(strconv.Itoa(i) + key)))
		idx := sort.SearchInts(m.keys, hash)
		m.keys = append(m.keys[:idx], m.keys[idx+1:]...)
		delete(m.hashmap, hash)
	}
}

Nov 05 '20 11:11 man-fish

我觉得hash函数的选择和哈希倾斜也会有很大的关系吧，如果hash函数选择有问题，就算增加再多的虚拟节点，节点也有可能分布在哈希环的某一小段，不知道一般业界是选择什么样的哈希函数的，有没有相关参考资料

Dec 08 '20 07:12 shiluoye

@shiluoye 一般来说，哈希函数考虑两个点：一个是碰撞率，一个是性能。比如 CRC、MD5、SHA1。

对于缓存来说，hash 之后再根据节点数量取模，因此 hash 函数的碰撞率影响并不大，而是模的大小，也就是节点的数量比较关键，这也是引入虚拟节点的原因，但是缓存对性能比较敏感。

而对于需要完整性校验的场合，碰撞率比较关键，而性能就比较次要了。一般使用 256位的 SHA1 算法，MD5 已经不再推荐了。CRC 即循环冗余校验，编码简单，性能高，但安全性就很差了。作为缓存的 hash 算法还是很合适的。

Dec 08 '20 09:12 geektutu

sort.Search 好像不能保证返回第一个大于hash的key值

Dec 10 '20 11:12 wanboyan

@wanboyan 举个例子？

Dec 10 '20 11:12 geektutu

有点看不懂了

Dec 15 '20 11:12 hzylyq

如果在32位的系统上， uint32强转成int，运行是否会异常？这里强转成int是为了使用sort.Ints函数吗？

Dec 16 '20 13:12 FinaLone

@FinaLone 这里是没关系的，即使转换成负数，并不影响后面的逻辑。使用 int 的确是方便使用各种函数。包括 sort.Ints sort.Search 等。

Dec 16 '20 15:12 geektutu

这里当节点数量变化后（新增节点/删除节点），只是简单的通过回调函数去加载一次数据吗，不考虑迁移变化节点的缓存吗，是否会同时造成缓存雪崩

Jan 29 '21 02:01 boyl

子曰：温故而知新

Mar 08 '21 01:03 wilgx0

@boyl 这里当节点数量变化后（新增节点/删除节点），只是简单的通过回调函数去加载一次数据吗，不考虑迁移变化节点的缓存吗，是否会同时造成缓存雪崩

雪崩是全部失效，用分布式一致性算法后只有部分失效

Apr 19 '21 14:04 callmePicacho

	hash := int(m.hash([]byte(key)))
	// Binary search for appropriate replica.
	idx := sort.Search(len(m.keys), func(i int) bool {
		return m.keys[i] >= hash
	})

	return m.hashMap[m.keys[idx%len(m.keys)]]

对于这段代码，sort.Search是在[0,len(m.keys))的范围二分查找最小满足的索引, 也就是idx的结果要么为-1，要么为[0,len(m.keys))之间. 那后面return语句中, idx没有必要除余len(m.keys)了吧

May 22 '21 13:05 whitebluepants

@whitebluepants
	hash := int(m.hash([]byte(key)))
	// Binary search for appropriate replica.
	idx := sort.Search(len(m.keys), func(i int) bool {
		return m.keys[i] >= hash
	})

	return m.hashMap[m.keys[idx%len(m.keys)]]
对于这段代码，sort.Search是在[0,len(m.keys))的范围二分查找最小满足的索引, 也就是idx的结果要么为-1，要么为[0,len(m.keys))之间. 那后面return语句中, idx没有必要除余len(m.keys)了吧

sort.Search是在[0,len(m.keys)]的范围

nums := []int{1,2,3,4,5,6}
	target := 7
	idx := sort.Search(len(nums), func(i int) bool {
		return nums[i] >= target
	})

	log.Printf("Len(nums) : %d, idx : %d\n", len(nums), idx)
	// 2021/05/25 18:45:03 Len(nums) : 6, idx : 6

May 25 '21 10:05 z-learner

改成了基于写时复制的并发读写的形式，添加了 Remove 操作（来自评论区）

type Hash func(data []byte)uint32

type Map struct {
	sync.Mutex
	// 哈希函数
	hash Hash
	// 虚拟节点倍数
	replicas int
	
	// 原子地存取 keys 和 hashMap
	values atomic.Value // values
}

type values struct {
	// 哈希环
	keys []int
	// 虚拟节点与真实节点的映射
	hashMap map[int]string
}

func NewMap(replicas int, hashFunc Hash) *Map {
	m := &Map{
		replicas: replicas,
		hash: hashFunc,
	}
	m.values.Store(&values{
		hashMap: make(map[int]string),
	})
	if m.hash == nil {
		m.hash = crc32.ChecksumIEEE
	}
	return m
}

// 添加节点
func (m *Map) Add(keys ...string) {
	m.Lock()
	defer m.Unlock()
	newValues := m.loadValues()
	for _, key := range keys {
		// 对每个 key(节点) 创建 m.replicas 个虚拟节点
		for i := 0; i < m.replicas; i++ {
			hash := int(m.hash([]byte(strconv.Itoa(i) + key)))
			newValues.keys = append(newValues.keys, hash)
			newValues.hashMap[hash] = key
		}
	}
	sort.Ints(newValues.keys)
	m.values.Store(newValues)
}

func (m *Map) Get(key string) string {
    values := m.loadValues()
	if len(values.keys) == 0 {
		return ""
	}
	hash := int(m.hash([]byte(key)))
	idx := sort.Search(len(values.keys), func(i int) bool {
		return values.keys[i] >= hash
	})
	// 如果 idx == len(m.keys)，说明应选择 m.keys[0]，
	// 因为 m.keys 是一个环状结构，用取余数的方式来处理这种情况
	return values.hashMap[values.keys[idx % len(values.keys)]]
}

func (m *Map) Remove(key string) {
	m.Lock()
	defer m.Unlock()
	newValues := m.loadValues()

	for i := 0; i < m.replicas; i++ {
		hash := int(m.hash([]byte(strconv.Itoa(i) + key)))
		idx := sort.SearchInts(newValues.keys, hash)
		if newValues.keys[idx] != hash {
			return
		}
		newValues.keys = append(newValues.keys[:idx], newValues.keys[idx+1:]...)
		delete(newValues.hashMap, hash)
	}

	m.values.Store(newValues)
}

func (m *Map) loadValues() *values {
    return m.values.Load().(*values)
}

func (m *Map) copyValues() *values {
	oldValues := m.loadValues()
	newValues := &values{
		keys:    make([]int, len(oldValues.keys)),
		hashMap: make(map[int]string),
	}
	copy(newValues.keys, oldValues.keys)
	for k, v := range oldValues.hashMap {
		newValues.hashMap[k] = v
	}
	return newValues
}

Jun 04 '21 15:06 yuanyb

开始感受到基础不够扎实对我造成的伤害了

Jul 26 '21 09:07 spiritbird

idx := sort.Search(len(m.keys), func(i int) bool {
return m.keys[i] >= （hash % m.keys[len(m.keys)-1]) // 这里要取模，否则大于虚拟节点最大值的key都会跑到索引为0的虚拟节点了
})

Aug 15 '21 15:08 shuifa

@oushuifa

idx := sort.Search(len(m.keys), func(i int) bool {
return m.keys[i] >= （hash % m.keys[len(m.keys)-1]) // 这里要取模，否则大于虚拟节点最大值的key都会跑到索引为0的虚拟节点了
})

你的担忧是多余的：

CRC32 计算 HASH 值并不依赖输入值大小，来决定返回值大小；所以我们可以将其 HASH 值视为分布均匀的
大于最大值时顺时针查找节点索引到 0 号节点，符合一致性哈希查找原则
况且你这样如果 m.keys[len(m.keys)-1] 正好是一个较小哈希值，此时索引到前半部分的几率大大增加，而后半部分几乎不会被索引到；反而破坏了原本的均衡性。

想解决均衡性问题，正确的姿势就是加大虚拟节点数值