cloudfloe icon indicating copy to clipboard operation
cloudfloe copied to clipboard

The Switzerland of Iceberg queries: neutral, easy entry across S3, R2, MinIO

๐ŸŒŠ Cloudfloe

Query your Apache Iceberg data lake in seconds. No clusters. No ops. Just SQL.

Cloudfloe Screenshot

โšก The Problem

You have data in Apache Iceberg. You just want to query it correctly.

But here's what you face:

  • Trino/Presto: Heavy clusters, complex setup, operational overhead
  • AWS Athena: Vendor lock-in, slower iteration, costs add up
  • Local DuckDB: Works great solo, painful to share and collaborate
  • Spark: Overkill for exploratory queries, slow startup
  • Direct Parquet reads: Fast but dangerous โ€” bypasses Iceberg metadata, can return deleted rows

You don't need a hammer when you need a magnifying glass.


๐Ÿ’ก The Solution: Cloudfloe

Cloudfloe is DuckDB-as-a-service for Apache Iceberg data lakes.

What it does:

  • ๐ŸงŠ Reads Iceberg correctly โ€” uses metadata layer, validates snapshots
  • ๐Ÿš€ Instant queries on S3, R2, or MinIO โ€” no data movement
  • ๐ŸŒ Browser-based SQL editor โ€” no CLI, no local setup
  • ๐Ÿ”“ Zero lock-in โ€” you own the data, we just query it
  • โšก Sub-second startup โ€” no cluster spin-up time

What it doesn't do:

  • โŒ Store your data (you keep it where it is)
  • โŒ Require infrastructure changes (just S3 credentials)
  • โŒ Support write operations (read-only by design)
  • โŒ Handle tables with row-level deletes (append-only Iceberg v1/v2 only)

Think of it as: A collaborative, web-based scratchpad for your Iceberg data lake.


โœจ Current Features

Feature Description
๐ŸงŠ Iceberg Native Reads via iceberg_scan() โ€” respects Iceberg metadata and snapshots
โœ… Table Validation Auto-detects row-level deletes and rejects unsafe tables
๐Ÿ”Œ Multi-Cloud Support AWS S3, Cloudflare R2, MinIO โ€” any S3-compatible storage
๐Ÿ–ฅ๏ธ Web SQL Editor Syntax highlighting, query history, sample queries
๐Ÿ“Š Instant Results DuckDB 1.4.1 engine, no cluster warmup, sub-second queries
๐Ÿ”’ Read-Only by Design No destructive operations โ€” query, don't mutate
๐Ÿณ Docker Ready One command to run locally, no complex setup

๐Ÿš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • S3-compatible storage with Iceberg table (or use our demo data)

1. Start Cloudfloe

git clone https://github.com/gordonmurray/cloudfloe
cd cloudfloe
docker compose up --build

Wait ~30 seconds for initialization, then open http://localhost:3000

2. Connect to Your Iceberg Table

In the UI Connection panel, enter:

AWS S3 Example:

Storage Type: AWS S3
AWS S3 Endpoint: s3.amazonaws.com (or leave blank for default)
Table Path: s3://your-bucket/warehouse/db/table_name
Access Key: AKIAIOSFODNN7EXAMPLE
Secret Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Region: us-east-1

Important Notes:

  • Table Path should point to the Iceberg table root (where /metadata folder is located)
  • Do NOT include /metadata in the path โ€” Cloudfloe adds it automatically
  • Trailing slashes are automatically removed

Click "Test Connection" โ€” if successful, a sample query will appear in the editor.

3. Run Your First Query

After connection succeeds, a query like this will be auto-loaded:

SELECT * FROM iceberg_scan('s3://your-bucket/warehouse/db/table_name') LIMIT 10;

Just click "Run Query" to see your data!


๐Ÿ“– Querying Iceberg Tables

Basic Query

SELECT * FROM iceberg_scan('s3://bucket/warehouse/db/table_name') LIMIT 100;

With Filters

SELECT user_id, event_type, timestamp
FROM iceberg_scan('s3://bucket/warehouse/events/user_events')
WHERE event_type = 'purchase'
  AND timestamp > '2024-01-01'
ORDER BY timestamp DESC;

Aggregations

SELECT
    date_trunc('day', timestamp) as day,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users
FROM iceberg_scan('s3://bucket/warehouse/events/user_events')
GROUP BY day
ORDER BY day DESC;

Inspect Iceberg Metadata

-- View table snapshots
SELECT * FROM iceberg_snapshots('s3://bucket/warehouse/db/table_name');

-- View table metadata (manifests, partitions, etc)
SELECT * FROM iceberg_metadata('s3://bucket/warehouse/db/table_name');

๐Ÿ” Setting Up S3 Access

IAM Policy for AWS S3

Your AWS credentials need these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Minimum required:

  • s3:ListBucket โ€” To list files in /metadata directory
  • s3:GetObject โ€” To read metadata and data files

Testing Access

Before using Cloudfloe, verify access with AWS CLI:

# Test 1: Can you list the metadata directory?
aws s3 ls s3://your-bucket/warehouse/db/table_name/metadata/

# Test 2: Can you read the version hint?
aws s3 cp s3://your-bucket/warehouse/db/table_name/metadata/version-hint.text -

If these work, Cloudfloe will work too.


๐Ÿ›ก๏ธ Important Limitations

โœ… Supported:

  • Iceberg v1 and v2 table formats
  • Append-only tables (no deletes)
  • Parquet data files
  • Time travel queries (via snapshots)
  • Partition pruning (DuckDB handles this)

โŒ Not Yet Supported:

  • Row-level deletes โ€” Tables with position or equality deletes will be rejected
  • Write operations โ€” Read-only for now
  • REST Catalog โ€” Direct S3 path access only
  • Schema evolution โ€” Reads current schema, doesn't handle complex migrations

If your table has deletes, you'll see:

Error: Table contains row-level deletes which are not supported.
This application only supports append-only Iceberg v1/v2 tables.

Solution: Compact your table first:

-- In Spark/Trino/Iceberg CLI:
CALL compact_table('your_table');

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Frontend      โ”‚  โ† Nginx + HTML/CSS/JS
โ”‚  (Port 3000)    โ”‚     CodeMirror SQL Editor
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ†“ HTTP
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Backend       โ”‚  โ† FastAPI + Python
โ”‚  (Port 8000)    โ”‚     DuckDB 1.4.1 + Iceberg Extension
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
         โ†“ S3 API
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   S3 Storage    โ”‚  โ† AWS S3 / R2 / MinIO
โ”‚                 โ”‚     Iceberg table (metadata + data)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components:

  • DuckDB 1.4.1: Query engine with native Iceberg support
  • Iceberg Extension: Reads Iceberg metadata and manifests
  • FastAPI: REST API for query execution and connection testing
  • HTTPFS Extension: S3-compatible storage access

๐Ÿงช Local Development

Run with Docker Compose (Recommended)

docker compose up --build

Run Backend Manually

cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload

Backend runs on http://localhost:8000

Run Frontend Manually

cd frontend
python3 -m http.server 3000

Frontend runs on http://localhost:3000


๐Ÿ“Š Query Stats

After running a query, click the "Query Stats" tab to see:

  • Execution Time: How long the query took (milliseconds)
  • Bytes Scanned: Approximate data size processed
  • Rows Returned: Number of rows in the result set

Note: Bytes scanned is a rough estimate based on returned data, not actual S3 bytes read.


๐Ÿค Contributing

Cloudfloe is in active development. Contributions welcome!

๐Ÿ’ฌ Questions?

Open an issue on GitHub for bugs or feature requests