使用Bitmap来存储用户标签

Open Caaalabash opened this issue 2 years ago • 0 comments

使用Bitmap来存储用户标签

1.背景：

假设现在有 m 个用户，n 个标签，如何设计标签系统，以便于更快的提供如下接口：

一个标签下的所有用户

多个标签下的所有用户

一个用户所有标签

多个用户的所有标签

容易想到，用户一张表、标签一张表，额外维护一张关联关系表如下：

tag	uid	result
vip	1	1
mobile	1	0
vip	2	0
mobile	2	1
vip	3	1
vip	4	1

缺陷：

存储：因为每个行记录同时标明了用户、标签、和结果，所以其中的重复数据非常的多，对数据库存储是个极大地浪费
速度：处理接口2、接口4的SQL语句类似于select uid from table where result = 1 and tag in (....)，在数据量大时对性能影响非常大

2. 位图 Bitmap

顾名思义，位图显然既用到了位运算，又用到了哈希表的思想，它能有效的解决两类问题：

存储大量值可以用布尔值标识的数据 => 省空间
部分有用到交、并、差等集合运算的数据 => 位运算效率高

2.1 哈希表思想的体现：

上方表格中，如果要展示所有用户的vip状态，其结构为：

{
  1: true,
  2: false,
  3: true,
  4: true,
  1000001: true
}

而用位图，表示为：

           第1000001位
01011......1

可以发现一个用户只需要1bit的存储空间，假设 100 万个用户，仅需1000000bit / 8 / 1024 = 122.0703125 KB存储空间，非常赚！

这里会引出两个问题：

MySQL自增整数id没问题，如果是Mongo这类ObjectId怎么处理？或者说数据库分库分表了怎么处理？

这是一个工程问题，不由 BitMap 来解决。核心就是把 id 映射成一个 BitMap 中的一个 bit。

如果 id 过于稀疏会不会浪费空间？假设只有两个用户，一个id为1，一个id为4000000，那存储的数据大概就类似: 01000000000…0000000001，中间浪费的空间如何处理？

使用压缩位图RoaringBitmap or 调整 id 映射关系，具体搜索 bitmap 压缩算法

2.2 位运算的场景：

假设已经了解了位运算的概念，那么可以开始如下运算：

按位与&：0100110 & 0101001 = 0100000

// 所有VIP用户（id为1、4、5）
0100110
// 所有男性用户（id为1、3、6）
0101001

// 计算所有男性VIP用户：（id为1）
0100000

按位异或^：0111111 ^ 0101001 = 0010110

// 所有用户（id为1、2、3、4、5、6）
0111111
// 所有男性用户（id为1、3、6）
0101001

// 计算所有非男性用户（id为2、4、5）
0010110

3. NodeJS + Redis 实例

这里也有一个问题：为什么用Redis？

Redis不是必须的，只是原生提供了可以对字符串进行位操作的命令，方便罢了

背景：一张标签表（id、name），一张模版表（id）实现上面的4个接口

初始化启动应用时，什么也不做，我的redis里面空空的，现在什么关联关系都没有
查询一个模版，需要查询它的标签（即3：一个用户所有标签），调用 bitMapConnect.getAllTagByUser([templateId])
查询指定标签的所有模版（即2：多个标签下的所有用户），调用 bitMapConnect.getAllUserByTag(tagIdList)
更新模版的标签，比对新旧标签差异，对新增项调用bitMapConnect.update(id, tagId, true)，对删除项调用bitMapConnect.update(id, tagId, false)

BitMapConnect类简单实现如下

class BitMapConnect {
  constructor(redis, tagModel, userModel, options = {}) {
    this.redis = redis
    this.tagPrefix = options.tagPrefix || 'tag'
    this.userPrefix = options.userPrefix || 'user'
    this.namespace = options.namespace || 'ns'
    this.sid = 0
    this.tagModel = tagModel
    this.userModel = userModel
  }
  async query(list, findUser) {
    const model = findUser ? this.userModel : this.tagModel
    const prefix = `${this.namespace}:${findUser ? this.tagPrefix : this.userPrefix}`

    // 位运算结果存储key，使用后删除
    const destKey = `${this.namespace}:temp:${this.sid++}`
    await this.redis.bitOp("AND", destKey, list.map(id => `${prefix}:${id}`))
    const charList = await this.redis.get(destKey).then(res => (res || "").split(""))
    await this.redis.del(destKey)

    const result = []

    for (let i = 0; i < charList.length; i++) {
      // 转为二进制字符串，不满8位需要补齐偏移量
      const char = charList[0].charCodeAt(0).toString(2)
      const offset = 8 - char.length
      for (let j = 0; j < char.length; j++) {
        if (char[j] === "1") {
          result.push(i * 8 + j + offset)
        }
      }
    }

    if (!result.length) return []
    return model.findAll({ where: { id: { [Op.in]: result } }, raw: true })
  }
  update(userId, tagId, isAdd) {
    const tagKey = `${this.namespace}:${this.tagPrefix}:${tagId}`
    const userKey = `${this.namespace}:${this.userPrefix}:${userId}`
    const val = isAdd ? 1 : 0

    return [this.redis.setBit(tagKey, userId, val), this.redis.setBit(userKey, tagId, val)]
  }
  // 通过tagList，找到所有user
  getAllUserByTag(tagList) {
    return this.query(tagList, true)
  }
  // 通过userList，找到所有tag
  getAllTagByUser(userList) {
    return this.query(userList, false)
  }
}

假设100万用户，1000个标签，内存占用计算为：

100万用户 * 1000标签 * 1bit / 8bit / 1024 / 1024 = 119.20928955078125 MB

redis命令耗时（单位微秒，不过样本较少）：

cmdstat_del:calls=39,usec=315,usec_per_call=8.08
cmdstat_get:calls=90,usec=774,usec_per_call=8.60

cmdstat_getbit:calls=9,usec=56,usec_per_call=6.22
cmdstat_bitop:calls=77,usec=1167,usec_per_call=15.16
cmdstat_setbit:calls=102,usec=363,usec_per_call=3.56

Feb 21 '23 08:02 Caaalabash

blog-frontend blog-frontend copied to clipboard