bug - Regression in `.sample()`
What happened?
It appears that ibis.sample is returning the whole dataframe and ignoring the fraction parameter.
I've also tested with a to_parquet('test_data.parquet') call and that saves a full copy of the dataset unsampled.
ipython test:
In [4]: ibis.__version__
Out[4]: '9.5.0'
In [5]: df = con.read_parquet("s3://sdm-threat-mlflow/data_more.parquet")
In [6]: df.count().execute()
Out[6]: 53214
In [7]: df.sample(0.1).count().execute()
Out[7]: 53214
What version of ibis are you using?
9.5.0
What backend(s) are you using, if any?
Duckdb with s3fs filesystem.
Environment is setup with this pre-script (env vars store S3 variables):
import s3fs
import ibis
import numpy as np
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)
con = ibis.duckdb.connect(":memory:")
con.register_filesystem(fs)
print("available variables: ")
print("`fs` - S3FS Object initialized from environmental variables")
print("`con` - Ibis duckdb connection with s3fs initialized")
Relevant log output
Command returns with no errors or log output
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Cant reproduce doing the following. Do you notice anything you did differently?
if __name__ == "__main__":
credentials = config.DevConfig.get_s3_credentials()
fs = fsspec.filesystem(
"s3",
key=credentials.access_key,
secret=credentials.secret_access_key,
)
con = ibis.duckdb.connect()
con.register_filesystem(fs)
df = con.read_parquet("s3://my-bucket/parquet-files/some.parquet")
count = df.count().execute()
sampled_count = df.sample(0.1).count().execute()
print(f"{count=}")
print(f"{sampled_count=}")
con.disconnect()
Outputs:
Note: Sample count varies between executions as ibis includes each row with a probability of 'fraction'. https://ibis-project.org/reference/expression-tables.html#ibis.expr.types.relations.Table.sample
count=1000
sampled_count=115
fyi: also works using s3fs instead of fsspec directly
fs = s3fs.S3FileSystem(anon=False, key=credentials.access_key, secret=credentials.secret_access_key)
I had the fsspec variables set as environmental variables. Also i may have been using a a minio bucket. but nothing else seems different. Also, the file was still accessable so not sure that this has anything to do with the issue.
I'll test again today in a fresh install and see if I can reproduce again.
Can you run the query without Ibis? I.e., using only the duckdb module (and fsspec)?
That way we can rule out Ibis here. Ibis doesn't do anything with creds/auth/fsspec except expose the underlying duckdb API.
Something really strange going on duckdb version was held constant through testing:
- When
Ibis-framework=9.3.0is installed there is no problem sampling with ibis. it just works - When
Ibis-framework=9.5.0is installed.sample(0.1)will return all rows. - Fractional sampling seems to not work in duckdb python for either scenario. It returns all or nothing, and the behavior of flips between all or nothing randomly.
In [31]: import duckdb
...: print(f"duckdb version: {duckdb.__version__}")
...: import fsspec
...: print(f"fsspec version: {fsspec.__version__}")
...: fs = fsspec.filesystem('s3')
...: duckdb.register_filesystem(fs)
...:
...: s3_path = 's3://sdm-threat-mlflow/data_more.parquet'
duckdb version: 1.1.2
fsspec version: 2024.6.1
In [32]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "')")
Out[32]:
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 137 │
└──────────────┘
In [33]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 10%").fetchone()
Out[33]: (0,)
In [34]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 20%").fetchone()
Out[34]: (137,)
In [35]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 30%").fetchone()
Out[35]: (0,)
In [36]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 40%").fetchone()
Out[36]: (0,)
In [37]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 50%").fetchone()
Out[37]: (137,)
In [38]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 60%").fetchone()
Out[38]: (137,)
In [39]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 70%").fetchone()
Out[39]: (0,)
In [40]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 80%").fetchone()
Out[40]: (0,)
In [41]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 90%").fetchone()
Out[41]: (137,)
In [42]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 100%").fetchone()
Out[42]: (137,)
Note that the broken sampling behavior is broken in 9.4.0 as well
This is reproducible with just the duckdb CLI in a public bucket:
D select version();
┌─────────────┐
│ "version"() │
│ varchar │
├─────────────┤
│ v1.1.2 │
└─────────────┘
D select count(*) from read_csv('https://storage.googleapis.com/ibis-tutorial-data/penguins.csv') using sample 10%;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 344 │
└──────────────┘
D select count(*) from read_csv('https://storage.googleapis.com/ibis-tutorial-data/penguins.csv') using sample 10%;
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
I tried with both an older duckdb and older ibis and am still seeing the all or none behavior.
By default DuckDB uses system sampling, which means samples are drawn for every batch of 2048 rows (for performance reasons). If you want to sample per-row you can use bernoulli sampling instead:
D select count(*) from read_csv('https://storage.googleapis.com/ibis-tutorial-data/penguins.csv') using sample 10% (bernoulli);
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 34 │
└──────────────┘
We should probably make this more clear in the documentation.
@Mytherin Much appreciated!
@rapatel0 I'll investigate this, since the default behavior in Ibis should be bernoulli.
I can't reproduce this on main, and we're compiling in bernoulli to the SQL:
In [1]: from ibis.interactive import *
In [2]: t = ibis.read_csv("https://storage.googleapis.com/ibis-tutorial-data/penguins.csv")
In [3]: ibis.to_sql(t.sample(0.1))
Out[3]:
SELECT
*
FROM "ibis_read_csv_34ezhrclbfezfovsmtehio2hii" AS "t0" TABLESAMPLE bernoulli (10.0 PERCENT)
In [4]: t.sample(0.1).count()
Out[4]:
┌────┐
│ 28 │
└────┘
In [5]: t.sample(0.1).count()
Out[5]:
┌────┐
│ 31 │
└────┘
Closing.