polars icon indicating copy to clipboard operation
polars copied to clipboard

Row-wise random sampling in LazyFrames

Open OneRaynyDay opened this issue 1 year ago • 12 comments

Row-wise random sampling

Suppose lf is a pl.LazyFrame, then one popular transformation on the lazy frame is to select rows with some preset probability parametrized by a bernoulli distribution with parameter p. In SQL this is expressed as:

SELECT * FROM table TABLESAMPLE BERNOULLI(10)

which corresponds to a random sample with probability 10%. I would like a similar API on the row-level for LazyFrames, and I imagine it to look something like:

# Sample each row with probability 0.1
lf.sample(p=0.1)

Optional arguments may include: distribution="bernoulli", but I don't see any other type of TABLESAMPLE distributions out there except for SYSTEM from trino which is storage-aware.

OneRaynyDay avatar Jul 07 '22 23:07 OneRaynyDay

Could you clarify what this request would add on top of the existing sample method? https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.sample.html. I see two pointers, but it is unclear what is blocking you:

  1. the existing method exists on DataFrame and Expr, but not directly on LazyFrame
  2. the existing method only supports Bernoulli (but how exactly would other distributions be interpreted, you would specify a row specific mean/variance/whatever parametrization the distribution takes?)

zundertj avatar Jul 08 '22 15:07 zundertj

Would be good to highlight that on the distributions side there has been an open request on adding weights: https://github.com/pola-rs/polars/issues/2661

zundertj avatar Jul 08 '22 15:07 zundertj

Let me see if I can help. Since it may help to demonstrate using data, let's start with this dataset of 100 records.

import polars as pl

nbr_obs = 100
df = pl.DataFrame({
    'row_nr': pl.arange(0, nbr_obs, eager=True),
})
df
shape: (100, 1)
┌────────┐
│ row_nr │
│ ---    │
│ i64    │
╞════════╡
│ 0      │
├╌╌╌╌╌╌╌╌┤
│ 1      │
├╌╌╌╌╌╌╌╌┤
│ 2      │
├╌╌╌╌╌╌╌╌┤
│ 3      │
├╌╌╌╌╌╌╌╌┤
│ ...    │
├╌╌╌╌╌╌╌╌┤
│ 96     │
├╌╌╌╌╌╌╌╌┤
│ 97     │
├╌╌╌╌╌╌╌╌┤
│ 98     │
├╌╌╌╌╌╌╌╌┤
│ 99     │
└────────┘

The Sample Method

The sample method on a DataFrame has two parameters, both of which return a specifed number or a specified percent of rows from the DataFrame. For example, if we want a sample that gives us exactly 70% of rows, we can use the frac keyword.

df.sample(frac=0.7).shape
>>> df.sample(frac=0.7).shape
(70, 1)

Note that each time we run this, we will get exactly 70 rows (70% of 100 rows). We may get different rows in the result set, but we will always get 70 total rows.

Row-wise Sampling

What I think this issue is requesting is row-wise sampling, which is something slightly different.

In row-wise sampling, we are making a decision, not on the DataFrame as a whole, but on a row-by-row basis. Essentially, for each row, we are flipping a coin and asking "Should this row be in the result set?". There is no consideration for the total number of rows that will be in the result set - only whether this particular row should be selected.

With our dataset above, we could simulate this by first drawing a random number for each row (from a uniform distribution). I am purposely omitting the seed parameter in the default_rng method, to demonstrate something later.

import numpy as np
(
    df
    .with_column(
        pl.Series(
            name="random_nbr",
            values=np.random.default_rng().uniform(0.0, 1.0, df.height),
        )
    )
)
shape: (100, 2)
┌────────┬────────────┐
│ row_nr ┆ random_nbr │
│ ---    ┆ ---        │
│ i64    ┆ f64        │
╞════════╪════════════╡
│ 0      ┆ 0.0422     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ 0.100758   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0.665123   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ 0.581239   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 96     ┆ 0.1509     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 97     ┆ 0.376525   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 98     ┆ 0.845115   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 99     ┆ 0.188381   │
└────────┴────────────┘

Next, we'll choose a p value of 0.7 (70%). In essence, we are saying "each row independently has a 70% chance of being included in the result set", with no consideration as to how many rows have already been selected or how many rows will be selected in total.

We'll simulate that by filtering any rows that have a random_nbr < 0.70.

p = 0.7
(
    df
    .with_column(
        pl.Series(
            name="random_nbr",
            values=np.random.default_rng().uniform(0.0, 1.0, df.height),
        )
    )
    .filter(pl.col('random_nbr') < p)
)
shape: (75, 2)
┌────────┬────────────┐
│ row_nr ┆ random_nbr │
│ ---    ┆ ---        │
│ i64    ┆ f64        │
╞════════╪════════════╡
│ 0      ┆ 0.674584   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1      ┆ 0.0712     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ 0.372068   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ 0.113028   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 96     ┆ 0.302775   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 97     ┆ 0.652685   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 98     ┆ 0.026801   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 99     ┆ 0.016079   │
└────────┴────────────┘

Note that for this run, I got 75 rows. Each time I run this, the total number of rows selected may change. There is no guarantee of exactly 70 rows.

Row-wise sampling can be extended further, such are choosing not just whether a row is included or not, but how many times a row should be included in the result set. Indeed, in some cases, the p value or distribution for each row may be different. (I won't go any further.)

Lazy Row-Wise Sampling

How might we accomplish the above in Lazy mode? Here's one simple workaround.

I'll first define a function that will encapsulate my choice of distribution and p value. The Series s passed in (in this case) is used solely to obtain the number of records in the LazyFrame, and for no other reason. (I'll discuss this further later.)

def sample_it(s: pl.Series) -> pl.Series:
    return pl.Series(
        values=np.random.binomial(1, 0.7, s.len()),
        dtype=pl.Boolean,
    )

We'll put our DataFrame in Lazy mode, and then use the map method. The pl.first method simply returns the first column of the LazyFrame. We are using it only to obtain the length of the LazyFrame. It plays no other role in this example.

(
    df
    .lazy()
    .with_column(pl.first().map(sample_it).alias("_sample"))
    .filter(pl.col("_sample"))
    # .drop("_sample")
    .collect()
)
shape: (64, 2)
┌────────┬─────────┐
│ row_nr ┆ _sample │
│ ---    ┆ ---     │
│ i64    ┆ bool    │
╞════════╪═════════╡
│ 1      ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2      ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3      ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8      ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 96     ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 97     ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 98     ┆ true    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 99     ┆ true    │
└────────┴─────────┘

Only 64 records on this run.

Extensions

The above technique can be extended to pass, not just the first column to our function, but columns that represent weights and/or parameters, so that the probability distribution that determines inclusion differs for each row. Stratified sampling might be such an example, where the p value differs by group. (The sample_it function above would obviously need to change.)

Was this helpful? If not, I apologize for the intrusion.

cbilot avatar Jul 08 '22 18:07 cbilot

Could you clarify what this request would add on top of the existing sample method? https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.sample.html. I see two pointers, but it is unclear what is blocking you:

  1. the existing method exists on DataFrame and Expr, but not directly on LazyFrame
  2. the existing method only supports Bernoulli (but how exactly would other distributions be interpreted, you would specify a row specific mean/variance/whatever parametrization the distribution takes?)

Hi @zundertj and @cbilot, thanks for the prompt replies! I am specifically talking about 1). I was only looking at the documentation for LazyFrame and not DataFrame so I wasn't aware there was an in-memory sampling method for this. I would like to do this for LazyFrames since my datasets are too big to fit in memory. Thank you @cbilot for the example for lazy frames(and you pointed out correctly that row-wise sampling only gives me the number of rows in expectation, as opposed to the dataframe sample function which gives exactly the specified number of rows), but if I understand correctly that requires me to specify a boolean mask in pl.Series() that's created in-memory which may once again hit memory problems (100+ billion rows). Is there a way to do this in a truly lazy way?

OneRaynyDay avatar Jul 08 '22 22:07 OneRaynyDay

I had not realized that in Polars sample with a fraction would guarantee that exactly the pre-specified fraction of rows would be returned. The Pandas docs are more precise on this point.

Unfortunately, I cannot find a way to achieve this. The closest I have got without constructing the series in advance is this (using df as defined by @cbilot ):

from random import random

# need to wrap, as apply supplies the value of the column whilst random.random does not take one
def random2(x):
    return random()

# this adds an additional column with +/- 70% True values
df.lazy().with_column((pl.first().apply(random2) < 0.7) .alias("_sample")).collect()  

# but the same expression cannot be used by filter
df.lazy().filter(pl.first().apply(random2) < 0.7).collect()  
pyo3_runtime.PanicException: should be no nth at this point

zundertj avatar Jul 11 '22 21:07 zundertj

I am also interested in this (row-wise sampling on lazy frames).

@zundertj I was wondering why

df.lazy().with_column((pl.first().apply(random2) < 0.7).alias("_sample")).filter(pr.col('_sample')).collect()

would not suffice - I guess ~I'm not understanding~ why with_column does/must instantiate the entire _sample column?

Edit: I didn't realize that .apply() must materialize the DataFrame.

indigoviolet avatar Aug 20 '22 07:08 indigoviolet

Please separate the API for uniform sampling and weighted sampling. The former can avoid iterating all rows.

In Spark, if you have 1,000,000 rows in S3 and sample 100 rows, it downloads the entire dataset.

elbaro avatar Jan 05 '23 03:01 elbaro

I am interested in this feature.

akdienes avatar Jan 31 '23 16:01 akdienes

Me too, I was going to implement it with pyarrow, but then thought I'd try DuckDB or Polars. DuckDB has it but it's slow. Going coding I guess.

blais avatar Jun 10 '23 04:06 blais

Please separate the API for uniform sampling and weighted sampling. The former can avoid iterating all rows.

In Spark, if you have 1,000,000 rows in S3 and sample 100 rows, it downloads the entire dataset. I'm not sure but

    height = lazy_frame.select(pl.count()).collect().row(0)[0]
    lazy_frame = lazy_frame.take_every(maths.floor(height/sample_size))

still takes the whole data set.... see https://github.com/pola-rs/polars/issues/8664

.fetch() doesn't help much either, but perhaps when the streaming option works one could append a random filter to only pass a fraction and drop the rest. That could bring some performance improvement.

icfly2 avatar Aug 11 '23 07:08 icfly2

I would like to have it too! For now, I need to materialize dataframes in memory for that

dazzle-me avatar Apr 09 '24 06:04 dazzle-me

if you are comfortable with using hash functions for randomness, a workaround is something like

lazy_df.with_row_index().filter(col("index").hash(seed)%10 == 1).drop("index")

to filter to 10% of rows, etc. and can change the seed each invocation if you want to allow overlapping subsampling

obviously this is kind of ugly

akdienes avatar Apr 22 '24 19:04 akdienes