glaredb icon indicating copy to clipboard operation
glaredb copied to clipboard

Benchmarks

Open scsmithr opened this issue 2 years ago • 5 comments

Description

Add/refactor benchmarks. We currently have a suite of benchmarks in bench_runner, but these might not have the features we need/want to meet our goals (below).

Most benchmarks should be done in client/server mode to avoid hybrid exec muddying up the numbers. But we should have something to know what the overhead is when running w/ hybrid exec.

Goals

  • Numbers to showcase performance on glaredb cloud (and cost). TPC-H sf=10 and sf=100 are good starting points. If feasible, I'd like to see if we could also get sf=1000
  • Have a baseline metric to help us understand what distributed exec gets us. Does a 4-node compute engine translate to 4x perf increase?
  • Understand how performance changes over time. E.g. what's the performance difference between two releases.
    • It might make sense to have Cloud run a suite of benchmarks against a deployment on every merge to main.

scsmithr avatar Sep 18 '23 14:09 scsmithr

just adding a few perf observations here.

using the https://datasets.clickhouse.com/hits_compatible/hits.parquet dataset.

Datafusion only takes ~74ms while we take ~300ms

> timeit {datafusion-cli -q -c "select count(*) from './hits.parquet'"}

74ms 955µs 166ns

> timeit {./target/release/glaredb -q "select count(*) from './hits.parquet'"}

296ms 460µs

I can't imagine we have that much overhead over datafusion that would warrant such a drastic performance hit

This one just never finishes on datafusion-cli but takes ~45sec locally. Polars only takes 100-500ms (76-400X) faster (polars is only perceived faster because of some implicit limits)

> timeit {./target/release/glaredb -q "select * from parquet_scan('./hits.parquet');" }
45sec 133ms 464µs

universalmind303 avatar Sep 19 '23 18:09 universalmind303

Here are some of my thoughts:

  1. We should run benchmarks using a "local" instance (we can use python bindings here). We can optionally connect the same to a cloud compute to calculate/analyze the cost of hybrid execution. Reasoning here being: We can quickly eliminate any postgres and network overhead when calculating and this gives us comparable metrics to say DuckDB.
    • We can run local benchmarks against each commit (on main) in the CI itself (using a dedicated machine). For each release we can have the benchmarks run against a cloud instance (hybrid and server) for some real benchmarks.
  2. We can modify the runtime threads (tokio) to analyze what distributed execution might get us. It's not an accurate measurement of the number by which we'll get a performance boost but does give us some insight. For eg. we can run the benchmarks for 1 vs 2 threads to analyze this. Still thinking about this though.

vrongmeal avatar Sep 22 '23 14:09 vrongmeal

Benchmarking Framework

Goals

  • TPC-H benchmarks: We want to run the TPC-H queries (and more) against a given dataset (scale factors, etc).
  • Different runners: We want to be able to run the benchmarks against:
    1. Local instance
    2. Postgres server (pgsrv)
    3. RPC server (rpcsrv)
  • Checkpoints: We want to be able to checkpoint the results (manually) to compare the results (change in performance).

Plan

  • We’ll have a script to download generated datasets (much simpler to already have the required data for access in GCS buckets).
  • We need different runner implementations.
    • The one with local instance can be run on each PR commit.
    • On each merge in main we can run the benchmarks against the cloud server.
  • We can directly use the runner to save/update benchmark data locally or on cloud (thanks to COPY TO):
    • Need a defined schema for storing
    • We can use GlareDB to analyse/diff benchmarks (from previous checkpoints)
    • Two checkpoints shall always exist — previous and current.

vrongmeal avatar Sep 26 '23 11:09 vrongmeal

Moving from 0.6.0 https://github.com/GlareDB/glaredb/milestone/25

greyscaled avatar Oct 25 '23 13:10 greyscaled

current state: benchmarks fail in CI despite running locally; though they are flaky.

For running with larger scale we'd need a custom runner. @universalmind303 can you take a look this week and see if there's something we can do to increase reliability/decrease failure incidence? [timebox make sure that there's nothing obvious before increasing runner size]

tychoish avatar Oct 30 '23 14:10 tychoish