criterion.rs icon indicating copy to clipboard operation
criterion.rs copied to clipboard

How do a parametrized & expensive setup before benchmarking?

Open mamcx opened this issue 2 years ago • 1 comments

I'm trying to benchmark some database code and found that setup the database is pretty complicated.

I need to create a new database on disk to run the benchmark, but can't see how to do it once per benchmark without doing it inside it.

The problem is that in the setup sections there are no parameters to see which run it is, so I can't use a pool or similar to split the db creation.

If I set up the DB outside the bench, it gets "copied" and all the benches run against the same DB on disk.

 #[derive(Debug)]
    struct Data {
        a: i32,
        b: u64,
        c: String,
    }

    impl Data {
        pub fn new(a: i32) -> Self {
            let b = (a + 13153) as u64;
            Self { a, b, c: b.to_string() }
        }
    }

    #[derive(Copy, Clone)]
    enum Runs {
        Tiny = 100,
    }

    impl Runs {
        pub fn range(self) -> Range<u16> {
            let x = self as u16;
            0..x
        }

        pub fn data(self) -> impl Iterator<Item = Data> {
            let x = self as u16;
            (0..x).into_iter().map(|x| Data::new(x as i32))
        }
    }

    mod bench_sqlite {
        use super::*;
        use rusqlite::{Connection, Transaction};

        fn build_db() -> ResultTest<Connection> {
            let tmp_dir = TempDir::new("sqlite_test")?;
            let db = Connection::open(tmp_dir.path().join("test.db"))?;
            db.execute_batch(
                "PRAGMA journal_mode = WAL;
                PRAGMA synchronous = normal;",
            )?;

            db.execute_batch(
                "CREATE TABLE data (
                a INTEGER PRIMARY KEY,
                b BIGINT NOT NULL,
                c TEXT);",
            )?;

            Ok(db)
        }

        pub(crate) fn insert_tx_per_row(run: Runs) -> ResultTest<()> {
            let db = build_db()?; // <-- HOW AVOID THIS?
            for row in run.data() {
                db.execute(
                    &format!("INSERT INTO data VALUES({} ,{}, '{}');", row.a, row.b, row.c),
                    (),
                )?;
            }
            Ok(())
        }
    }

    fn bench_insert_tx_per_row(c: &mut Criterion) {
        let mut group = c.benchmark_group("insert row");
        let run = Runs::Tiny;
        group.throughput(Throughput::Elements(run as u64));

        group.bench_function(BenchmarkId::new(SQLITE, 1), |b| {
            b.iter(|| bench_sqlite::insert_tx_per_row(run))
        });
        group.bench_function(BenchmarkId::new(PG, 1), |b| {
            b.iter(|| bench_pg::insert_tx_per_row(run))
        });

        group.finish();
    }

    criterion_group!(benches, bench_insert_tx_per_row);
    criterion_main!(benches);```

mamcx avatar Nov 28 '22 15:11 mamcx

I've similar problem – I'm testing performance of an algorithm with different sizes of inputs. So I have for loop creating exponentially larger input data. But I do not always run the whole benchmark, instead quite frequently I'm running benchmark for a given size.

So I have something like this:

let mut g = c.benchmark_group("some expensive to setup benchmark");
for size in &[2 * MB, 4 * MB, 8 * MB, 16 * MB,...] {
  // point (1)
  g.bench_with_input(BenchmarkId::from_parameter(size), size, |b, &_| {
    // point (2)
    b.iter_batched(
      move || { 
        //point (3)
      },
      move |i| perform_test(),
      BatchSize::SmallInput,
    );
  });
}

Each of the commented points has it's own problem in terms of generating test data:

  1. if I generate test data at point (1), then data for all test sizes will be generated even if I run the benchmark for one specific size
  2. at point (2) test data will be generated multiple times for each benchmark
  3. at point (3) test data will be generated per iteration.

All the options are slow when I have to deal with expensive to generate test data.

Solution I came up with is to use LazyCell at point (1), and then reference it at points (2) and (3) (wherever I need the data)

bazhenov avatar May 04 '23 04:05 bazhenov