koalas icon indicating copy to clipboard operation
koalas copied to clipboard

Implementing the full functionality of the 'sample' function

Open chi2liu opened this issue 5 years ago • 2 comments

The current implementation of the sample function is based on the sample function of pyspark, and the parameter n is not supported, and frac cannot be empty.

From the source code of pandas, https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5076 the implementation of the sample function relies on methods such as iloc, take and reindex. These methods are currently supported by koalas. Therefore, the implementation of the sample function can be based on the current logic of pandas.

I will try to implement the sample function of frame and series from this idea.

chi2liu avatar Nov 04 '20 15:11 chi2liu

I also noticed that Series.sample doesn't support frac right now (koalas 1.4.0). Is that expected? And do you have a timeline for #1893 being merged?

amueller avatar Nov 24 '20 20:11 amueller

Hi @amueller,

Seems like Series.sample supports the frac parameter now.

  • https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.Series.sample.html

For #1893, now it's stuck by a performance concern (https://github.com/databricks/koalas/pull/1893#discussion_r521869698). Could you kindly advice us if you have a good idea?

Thanks!

ueshin avatar Nov 24 '20 22:11 ueshin