cloudfloe
cloudfloe copied to clipboard
The Switzerland of Iceberg queries: neutral, easy entry across S3, R2, MinIO
๐ Cloudfloe
Query your Apache Iceberg data lake in seconds. No clusters. No ops. Just SQL.

โก The Problem
You have data in Apache Iceberg. You just want to query it correctly.
But here's what you face:
- Trino/Presto: Heavy clusters, complex setup, operational overhead
- AWS Athena: Vendor lock-in, slower iteration, costs add up
- Local DuckDB: Works great solo, painful to share and collaborate
- Spark: Overkill for exploratory queries, slow startup
- Direct Parquet reads: Fast but dangerous โ bypasses Iceberg metadata, can return deleted rows
You don't need a hammer when you need a magnifying glass.
๐ก The Solution: Cloudfloe
Cloudfloe is DuckDB-as-a-service for Apache Iceberg data lakes.
What it does:
- ๐ง Reads Iceberg correctly โ uses metadata layer, validates snapshots
- ๐ Instant queries on S3, R2, or MinIO โ no data movement
- ๐ Browser-based SQL editor โ no CLI, no local setup
- ๐ Zero lock-in โ you own the data, we just query it
- โก Sub-second startup โ no cluster spin-up time
What it doesn't do:
- โ Store your data (you keep it where it is)
- โ Require infrastructure changes (just S3 credentials)
- โ Support write operations (read-only by design)
- โ Handle tables with row-level deletes (append-only Iceberg v1/v2 only)
Think of it as: A collaborative, web-based scratchpad for your Iceberg data lake.
โจ Current Features
| Feature | Description |
|---|---|
| ๐ง Iceberg Native | Reads via iceberg_scan() โ respects Iceberg metadata and snapshots |
| โ Table Validation | Auto-detects row-level deletes and rejects unsafe tables |
| ๐ Multi-Cloud Support | AWS S3, Cloudflare R2, MinIO โ any S3-compatible storage |
| ๐ฅ๏ธ Web SQL Editor | Syntax highlighting, query history, sample queries |
| ๐ Instant Results | DuckDB 1.4.1 engine, no cluster warmup, sub-second queries |
| ๐ Read-Only by Design | No destructive operations โ query, don't mutate |
| ๐ณ Docker Ready | One command to run locally, no complex setup |
๐ Quick Start
Prerequisites
- Docker & Docker Compose
- S3-compatible storage with Iceberg table (or use our demo data)
1. Start Cloudfloe
git clone https://github.com/gordonmurray/cloudfloe
cd cloudfloe
docker compose up --build
Wait ~30 seconds for initialization, then open http://localhost:3000
2. Connect to Your Iceberg Table
In the UI Connection panel, enter:
AWS S3 Example:
Storage Type: AWS S3
AWS S3 Endpoint: s3.amazonaws.com (or leave blank for default)
Table Path: s3://your-bucket/warehouse/db/table_name
Access Key: AKIAIOSFODNN7EXAMPLE
Secret Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Region: us-east-1
Important Notes:
- Table Path should point to the Iceberg table root (where
/metadatafolder is located) - Do NOT include
/metadatain the path โ Cloudfloe adds it automatically - Trailing slashes are automatically removed
Click "Test Connection" โ if successful, a sample query will appear in the editor.
3. Run Your First Query
After connection succeeds, a query like this will be auto-loaded:
SELECT * FROM iceberg_scan('s3://your-bucket/warehouse/db/table_name') LIMIT 10;
Just click "Run Query" to see your data!
๐ Querying Iceberg Tables
Basic Query
SELECT * FROM iceberg_scan('s3://bucket/warehouse/db/table_name') LIMIT 100;
With Filters
SELECT user_id, event_type, timestamp
FROM iceberg_scan('s3://bucket/warehouse/events/user_events')
WHERE event_type = 'purchase'
AND timestamp > '2024-01-01'
ORDER BY timestamp DESC;
Aggregations
SELECT
date_trunc('day', timestamp) as day,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM iceberg_scan('s3://bucket/warehouse/events/user_events')
GROUP BY day
ORDER BY day DESC;
Inspect Iceberg Metadata
-- View table snapshots
SELECT * FROM iceberg_snapshots('s3://bucket/warehouse/db/table_name');
-- View table metadata (manifests, partitions, etc)
SELECT * FROM iceberg_metadata('s3://bucket/warehouse/db/table_name');
๐ Setting Up S3 Access
IAM Policy for AWS S3
Your AWS credentials need these permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
Minimum required:
s3:ListBucketโ To list files in/metadatadirectorys3:GetObjectโ To read metadata and data files
Testing Access
Before using Cloudfloe, verify access with AWS CLI:
# Test 1: Can you list the metadata directory?
aws s3 ls s3://your-bucket/warehouse/db/table_name/metadata/
# Test 2: Can you read the version hint?
aws s3 cp s3://your-bucket/warehouse/db/table_name/metadata/version-hint.text -
If these work, Cloudfloe will work too.
๐ก๏ธ Important Limitations
โ Supported:
- Iceberg v1 and v2 table formats
- Append-only tables (no deletes)
- Parquet data files
- Time travel queries (via snapshots)
- Partition pruning (DuckDB handles this)
โ Not Yet Supported:
- Row-level deletes โ Tables with position or equality deletes will be rejected
- Write operations โ Read-only for now
- REST Catalog โ Direct S3 path access only
- Schema evolution โ Reads current schema, doesn't handle complex migrations
If your table has deletes, you'll see:
Error: Table contains row-level deletes which are not supported.
This application only supports append-only Iceberg v1/v2 tables.
Solution: Compact your table first:
-- In Spark/Trino/Iceberg CLI:
CALL compact_table('your_table');
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ
โ Frontend โ โ Nginx + HTML/CSS/JS
โ (Port 3000) โ CodeMirror SQL Editor
โโโโโโโโโโฌโโโโโโโโโ
โ
โ HTTP
โโโโโโโโโโโโโโโโโโโ
โ Backend โ โ FastAPI + Python
โ (Port 8000) โ DuckDB 1.4.1 + Iceberg Extension
โโโโโโโโโโฌโโโโโโโโโ
โ
โ S3 API
โโโโโโโโโโโโโโโโโโโ
โ S3 Storage โ โ AWS S3 / R2 / MinIO
โ โ Iceberg table (metadata + data)
โโโโโโโโโโโโโโโโโโโ
Key Components:
- DuckDB 1.4.1: Query engine with native Iceberg support
- Iceberg Extension: Reads Iceberg metadata and manifests
- FastAPI: REST API for query execution and connection testing
- HTTPFS Extension: S3-compatible storage access
๐งช Local Development
Run with Docker Compose (Recommended)
docker compose up --build
Run Backend Manually
cd backend
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --reload
Backend runs on http://localhost:8000
Run Frontend Manually
cd frontend
python3 -m http.server 3000
Frontend runs on http://localhost:3000
๐ Query Stats
After running a query, click the "Query Stats" tab to see:
- Execution Time: How long the query took (milliseconds)
- Bytes Scanned: Approximate data size processed
- Rows Returned: Number of rows in the result set
Note: Bytes scanned is a rough estimate based on returned data, not actual S3 bytes read.
๐ค Contributing
Cloudfloe is in active development. Contributions welcome!
๐ฌ Questions?
Open an issue on GitHub for bugs or feature requests