ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug - Regression in `.sample()`

Open rapatel0 opened this issue 1 year ago • 2 comments

What happened?

It appears that ibis.sample is returning the whole dataframe and ignoring the fraction parameter.

I've also tested with a to_parquet('test_data.parquet') call and that saves a full copy of the dataset unsampled.

ipython test:

In [4]: ibis.__version__
Out[4]: '9.5.0'

In [5]: df = con.read_parquet("s3://sdm-threat-mlflow/data_more.parquet")

In [6]: df.count().execute()
Out[6]: 53214

In [7]: df.sample(0.1).count().execute()
Out[7]: 53214

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

Duckdb with s3fs filesystem.

Environment is setup with this pre-script (env vars store S3 variables):

import s3fs
import ibis
import numpy as np
import pandas as pd

fs = s3fs.S3FileSystem(anon=False)

con = ibis.duckdb.connect(":memory:")
con.register_filesystem(fs)

print("available variables: ")
print("`fs` - S3FS Object initialized from environmental variables")
print("`con` - Ibis duckdb connection with s3fs initialized")

Relevant log output

Command returns with no errors or log output

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

rapatel0 avatar Oct 09 '24 20:10 rapatel0

Cant reproduce doing the following. Do you notice anything you did differently?

if __name__ == "__main__":
    credentials = config.DevConfig.get_s3_credentials()
    fs = fsspec.filesystem(
        "s3",
        key=credentials.access_key,
        secret=credentials.secret_access_key,
    )
    con = ibis.duckdb.connect()
    con.register_filesystem(fs)

    df = con.read_parquet("s3://my-bucket/parquet-files/some.parquet")

    count = df.count().execute()
    sampled_count = df.sample(0.1).count().execute()

    print(f"{count=}")
    print(f"{sampled_count=}")
    con.disconnect()

Outputs:

Note: Sample count varies between executions as ibis includes each row with a probability of 'fraction'. https://ibis-project.org/reference/expression-tables.html#ibis.expr.types.relations.Table.sample

count=1000
sampled_count=115

fyi: also works using s3fs instead of fsspec directly

    fs = s3fs.S3FileSystem(anon=False, key=credentials.access_key, secret=credentials.secret_access_key)

akanz1 avatar Oct 13 '24 12:10 akanz1

I had the fsspec variables set as environmental variables. Also i may have been using a a minio bucket. but nothing else seems different. Also, the file was still accessable so not sure that this has anything to do with the issue.

I'll test again today in a fresh install and see if I can reproduce again.

rapatel0 avatar Oct 19 '24 17:10 rapatel0

Can you run the query without Ibis? I.e., using only the duckdb module (and fsspec)?

That way we can rule out Ibis here. Ibis doesn't do anything with creds/auth/fsspec except expose the underlying duckdb API.

cpcloud avatar Oct 21 '24 18:10 cpcloud

Something really strange going on duckdb version was held constant through testing:

  • When Ibis-framework=9.3.0 is installed there is no problem sampling with ibis. it just works
  • When Ibis-framework=9.5.0 is installed .sample(0.1) will return all rows.
  • Fractional sampling seems to not work in duckdb python for either scenario. It returns all or nothing, and the behavior of flips between all or nothing randomly.
In [31]: import duckdb
    ...: print(f"duckdb version: {duckdb.__version__}")
    ...: import fsspec
    ...: print(f"fsspec version: {fsspec.__version__}")
    ...: fs = fsspec.filesystem('s3')
    ...: duckdb.register_filesystem(fs)
    ...: 
    ...: s3_path = 's3://sdm-threat-mlflow/data_more.parquet'
duckdb version: 1.1.2
fsspec version: 2024.6.1

In [32]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "')")
Out[32]: 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│          137 │
└──────────────┘

In [33]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 10%").fetchone()
Out[33]: (0,)

In [34]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 20%").fetchone()
Out[34]: (137,)

In [35]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 30%").fetchone()
Out[35]: (0,)

In [36]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 40%").fetchone()
Out[36]: (0,)

In [37]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 50%").fetchone()
Out[37]: (137,)

In [38]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 60%").fetchone()
Out[38]: (137,)

In [39]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 70%").fetchone()
Out[39]: (0,)

In [40]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 80%").fetchone()
Out[40]: (0,)

In [41]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 90%").fetchone()
Out[41]: (137,)

In [42]: duckdb.sql("SELECT COUNT(*) FROM read_parquet('" + s3_path + "') USING SAMPLE 100%").fetchone()
Out[42]: (137,)

rapatel0 avatar Oct 29 '24 16:10 rapatel0

Note that the broken sampling behavior is broken in 9.4.0 as well

rapatel0 avatar Oct 29 '24 16:10 rapatel0

This is reproducible with just the duckdb CLI in a public bucket:

D select version();
┌─────────────┐
│ "version"() │
│   varchar   │
├─────────────┤
│ v1.1.2      │
└─────────────┘
D select count(*) from read_csv('https://storage.googleapis.com/ibis-tutorial-data/penguins.csv') using sample 10%;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│          344 │
└──────────────┘
D select count(*) from read_csv('https://storage.googleapis.com/ibis-tutorial-data/penguins.csv') using sample 10%;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            0 │
└──────────────┘

cpcloud avatar Oct 30 '24 15:10 cpcloud

I tried with both an older duckdb and older ibis and am still seeing the all or none behavior.

cpcloud avatar Oct 30 '24 18:10 cpcloud

By default DuckDB uses system sampling, which means samples are drawn for every batch of 2048 rows (for performance reasons). If you want to sample per-row you can use bernoulli sampling instead:

D select count(*) from read_csv('https://storage.googleapis.com/ibis-tutorial-data/penguins.csv') using sample 10% (bernoulli);
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           34 │
└──────────────┘

We should probably make this more clear in the documentation.

Mytherin avatar Oct 31 '24 08:10 Mytherin

@Mytherin Much appreciated!

@rapatel0 I'll investigate this, since the default behavior in Ibis should be bernoulli.

cpcloud avatar Nov 01 '24 11:11 cpcloud

I can't reproduce this on main, and we're compiling in bernoulli to the SQL:

In [1]: from ibis.interactive import *

In [2]: t = ibis.read_csv("https://storage.googleapis.com/ibis-tutorial-data/penguins.csv")

In [3]: ibis.to_sql(t.sample(0.1))
Out[3]:
SELECT
  *
FROM "ibis_read_csv_34ezhrclbfezfovsmtehio2hii" AS "t0" TABLESAMPLE bernoulli (10.0 PERCENT)

In [4]: t.sample(0.1).count()
Out[4]:
┌────┐
│ 28 │
└────┘

In [5]: t.sample(0.1).count()
Out[5]:
┌────┐
│ 31 │
└────┘

Closing.

cpcloud avatar Nov 01 '24 11:11 cpcloud