roaring-rs icon indicating copy to clipboard operation
roaring-rs copied to clipboard

Feature Request: `RoaringBitmap::from_lsb0_bytes`

Open lemolatoon opened this issue 1 year ago • 9 comments

The Feature Explanation

  • Adding new creation function for RoaringBitmap
 pub fn from_lsb0_bytes(offset: u32, bytes: &[u8]) -> RoaringBitmap

Function Behavior

  • Interpret bytes as little endian bytes bitmap, and construct RoaringBitmap
  • offset can be used to offset the passing bitmap's index
  • If offset is not aligned to # of bits of Container's Store::Bitmap (# of bits of Box<[u64; 1024]>), this function panics
use roaring::RoaringBitmap;

let bytes = [0b00000101, 0b00000010, 0b00000000, 0b10000000];
//             ^^^^^^^^    ^^^^^^^^    ^^^^^^^^    ^^^^^^^^
//             76543210          98
let rb = RoaringBitmap::from_lsb0_bytes(0, &bytes);
assert!(rb.contains(0));
assert!(!rb.contains(1));
assert!(rb.contains(2));
assert!(rb.contains(9));
assert!(rb.contains(31));

let rb = RoaringBitmap::from_lsb0_bytes(8, &bytes);
assert!(rb.contains(8));
assert!(!rb.contains(9));
assert!(rb.contains(10));
assert!(rb.contains(17));
assert!(rb.contains(39));

Motivation

Sometimes bitmap is calculated by SIMD instructions. The result of SIMD instruction is likely to be already bitmask, not the series of bitmap indicies.

Under current implementation, when you intend use RoaringBitmap with bitmask produced by SIMD instruction, you have to use RoaringBitmap::sorted_iter or just insert one by one.

To solve this problem, I implemented RoaringBitmap::from_bitmap_bytes, which can be used to construct directly from bitmask.

Example of Production of Bitmask by SIMD instructions

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=40ddd13554c171be31fe53893401d40f

use std::arch::x86_64::*;

#[target_feature(enable = "avx2")]
unsafe fn compare_u8_avx2(a: &[u8], b: &[u8]) -> u32 {
    assert!(
        a.len() == 32 && b.len() == 32,
        "Inputs must have a length of 32."
    );

    // Load the data into 256-bit AVX2 registers
    let a_vec = _mm256_loadu_si256(a.as_ptr() as *const __m256i);
    let b_vec = _mm256_loadu_si256(b.as_ptr() as *const __m256i);

    // Perform comparison (a == b)
    let cmp_result = _mm256_cmpeq_epi8(a_vec, b_vec);

    // Extract the comparison result as a bitmask
    let mask = _mm256_movemask_epi8(cmp_result);

    mask as u32
}

fn main() {
    let a: [u8; 32] = [
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
        26, 27, 28, 29, 30, 31, 32,
    ];
    let b: [u8; 32] = [
        1, 0, 3, 4, 5, 6, 7, 8, 0, 10, 11, 12, 13, 0, 15, 16, 0, 18, 19, 20, 21, 0, 23, 24, 0, 26,
        27, 0, 29, 0, 31, 32,
    ];

    let mask = unsafe {
        compare_u8_avx2(&a, &b)
    };
    println!("Bitmask: {:#034b}", mask);
    // Bitmask: 0b11010110110111101101111011111101
    print!("Bitmask (little endian u8): ");
    for b in mask.to_le_bytes() {
        print!("{:08b} ", b);
    }
    println!();
    // Bitmask (little endian u8): 11111101 11011110 11011110 11010110 
    
    let n = 2;
    println!("Bitmask at {n}: {}", mask & (1 << n) != 0);
    // Bitmask at 2: true
}

Benchmark Result

On my laptop (Apple M3 MacBook Air Sonoma14.3 Memory 16 GB), in most cases from_lsb0_bytes is much faster than from_sorted_iter.

Part of Results

creation/from_bitmap_bytes/census-income_srt                                                                             
                        time:   [984.25 µs 987.00 µs 990.37 µs]
                        thrpt:  [6.1521 Gelem/s 6.1731 Gelem/s 6.1904 Gelem/s]
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
creation/from_sorted_iter/census-income_srt                                                                            
                        time:   [23.383 ms 23.397 ms 23.413 ms]
                        thrpt:  [260.24 Melem/s 260.41 Melem/s 260.57 Melem/s]

lemolatoon avatar Aug 21 '24 00:08 lemolatoon