opteryx icon indicating copy to clipboard operation
opteryx copied to clipboard

πŸ¦– A SQL-on-everything Query Engine you can execute over multiple databases and file formats. Query your data, where it lives.

Results 266 opteryx issues
Sort by recently updated
recently updated
newest added
trafficstars

as of 0.19.1a956 ~~~sql /*28*/ SELECT REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer '' GROUP BY REGEXP_REPLACE(Referer, '^https?://(?:www\.)?([^/]+)/.*$', '\1') HAVING COUNT(*)...

- [ ] Row estimates - [ ] Correlated filters - [ ] bloom filter disabled - [ ] bloom filter on

High Priority 1️⃣

The reader statistics should be updated calls = number of files/blobs/chunks read records_in = number of prefiltered records

Yes, you can use min/max values, row counts, and metadata from Parquet row groups and Iceberg file-level statistics to estimate the distribution of a column without reading the full dataset....

~~~sql import numpy as np import math def estimate_selectivity( lower: float, upper: float, total_records: int, cardinality: int, value: float, filter_type: str ) -> float: """ Estimates filter selectivity given column...

May be a way to get filters in parallel without full blown parallel engine

Here’s an optimized Cython implementation using memcmp for byte-wise comparison instead of Python slicing. This should be even faster because it avoids unnecessary slicing operations and compares raw memory directly....

~~~sql SELECT * FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10; ~~~ This query performs a lot slower than other engines. I'm not sure how they would...

initial testing, 8k was too small and created too many calls, try larger numbers

combine adjacent filter steps into single steps.