pg_analytics
pg_analytics copied to clipboard
DuckDB-powered data lake analytics from Postgres
Important Notice
pg_analytics has been succeeded by pg_lakehouse. Whereas pg_analytics is designed to query over data stored within Postgres, pg_lakehouse queries over external data lakes like S3 and table formats like Iceberg, storing no data inside Postgres. Because we are strong believers of separation of storage and compute and have seen greater demand for pg_lakehouse, we're choosing to focus our development efforts on pg_lakehouse for the next few months.
As a result, pg_analytics has been moved out of the main ParadeDB repo. It is possible that development on pg_analytics will resume in the future.
Overview
pg_analytics is an extension that accelerates analytical query processing inside Postgres. The performance of analytical queries that leverage pg_analytics is comparable to the performance of dedicated OLAP databases — without the need to extract, transform, and load (ETL) the data from your Postgres instance into another system. The purpose of pg_analytics is to be a drop-in solution for fast analytics in Postgres with zero ETL.
The primary dependencies are:
- [x] Apache Arrow for column-oriented memory format
- [x] Apache DataFusion for vectorized query execution with SIMD
- [x] Apache Parquet for persistence
- [x] Delta Lake as a storage framework with ACID properties
- [x] pgrx, the framework for creating Postgres extensions in Rust
How It Works
These libraries are the building blocks of many modern analytical databases and enable column-oriented storage, efficient data compression, and vectorized query execution within Postgres. Please see our blog post for a deep dive into how it works.
Benchmarks
With pg_analytics installed, ParadeDB is the fastest Postgres-based analytical database and outperforms many specialized OLAP systems. On Clickbench, ParadeDB is 94x faster than regular Postgres, 8x faster than Elasticsearch, and almost ties Clickhouse.
For an apples-to-apples comparison, these benchmarks were run on a c6a.4xlarge with 500GB storage. None of the databases were tuned. The (Parquet, single) Clickhouse variant was selected because it most closely matches ParadeDB's Parquet storage.
You can view the ParadeDB ClickBench results against other Postgres-compatible OLAP databases here.
Getting Started
This toy example demonstrates how to get started.
CREATE EXTENSION pg_analytics;
-- Create a parquet table
CREATE TABLE t (a int) USING parquet;
-- pg_analytics supercharges the performance of any
-- Postgres query run on a parquet table
INSERT INTO t VALUES (1), (2), (3);
SELECT COUNT(*) FROM t;
Parquet Tables
You can interact with parquet tables the same way as with normal Postgres tables. However, there are a few exceptions.
Append Only
Because column-oriented storage formats are not designed for fast updates, parquet tables are append only.
This means that parquet tables do not support UPDATE and DELETE clauses. Data that is frequently updated
should be stored in regular Postgres heap tables.
Storage Optimization
The VACUUM FULL <table_name> command is used to optimize a table's storage by bin-packing small Parquet
files into larger files, which can significantly improve query time and compression.
Roadmap
pg_analytics is currently in beta.
Known Limitations
As pg_analytics becomes production-ready, many of these will be resolved.
- [ ]
UPDATEandDELETE - [ ] Some Postgres types, notably
jsonandtimestamptz - [ ] Write-ahead log (WAL) support
- [ ] Collations
- [ ]
INSERT ... ON CONFLICTclauses
Development
Install Rust
To develop the extension, first install Rust v1.76.0 using rustup. We will soon make the extension compatible with newer versions of Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install 1.76.0
# We recommend setting the default version to 1.76.0 for consistency across your system
rustup default 1.76.0
Note: While it is possible to install Rust via your package manager, we recommend using rustup as we've observed inconcistencies with Homebrew's Rust installation on macOS.
Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project:
Install Postgres
# macOS
brew install postgresql@16
# Ubuntu
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
sudo apt-get update && sudo apt-get install -y postgresql-16 postgresql-server-dev-16
If you are using Postgres.app to manage your macOS PostgreSQL, you'll need to add the pg_config binary to your path before continuing:
export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin"
Install pgrx
Then, install and initialize pgrx:
# Note: Replace --pg16 with your version of Postgres, if different (i.e. --pg15, --pg14, etc.)
cargo install --locked cargo-pgrx --version 0.12.0-alpha.1
# macOS arm64
cargo pgrx init --pg16=/opt/homebrew/opt/postgresql@16/bin/pg_config
# macOS amd64
cargo pgrx init --pg16=/usr/local/opt/postgresql@16/bin/pg_config
# Ubuntu
cargo pgrx init --pg16=/usr/lib/postgresql/16/bin/pg_config
If you prefer to use a different version of Postgres, update the --pg flag accordingly.
Note: While it is possible to develop using pgrx's own Postgres installation(s), via cargo pgrx init without specifying a pg_config path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's.
Configure Shared Preload Libraries
This extension uses Postgres hooks to intercept Postgres queries. In order to enable these hooks, the extension
must be added to shared_preload_libraries inside postgresql.conf. If you are using Postgres 16, this file can be found under ~/.pgrx/data-16.
# Inside postgresql.conf
shared_preload_libraries = 'pg_analytics'
Run Without Optimized Build
The extension can be developed with or without an optimized build. An optimized build improves query times by 10-20x but also significantly increases build times.
To launch the extension without an optimized build, run
cargo pgrx run
Run With Optimized Build
First, switch to latest Rust Nightly (as of writing, 1.77) via:
rustup update nightly
rustup override set nightly
Then, reinstall pgrx for the new version of Rust:
cargo install --locked cargo-pgrx --version 0.12.0-alpha.1 --force
Finally, run to build in release mode with SIMD:
cargo pgrx run --release
Note that this may take several minutes to execute.
To revert back to the stable version of Rust, run:
rustup override unset
Run Benchmarks
We support two types of benchmarks: ClickBench and TPC-H. ClickBench is ideal for testing analytical queries, while TPC-H is ideal for testing JOINs. To run the benchmarks, cd into the benchmarks/clickbench or benchmarks/tpch directory and run ./benchmark.sh -t <flag>. The -t flag is the version to benchmark, either x.y.z or latest to pull a version from DockerHub, or local to build the Docker image locally.