Implementing the full functionality of the 'sample' function
The current implementation of the sample function is based on the sample function of pyspark, and the parameter n is not supported, and frac cannot be empty.
From the source code of pandas, https://github.com/pandas-dev/pandas/blob/master/pandas/core/generic.py#L5076 the implementation of the sample function relies on methods such as iloc, take and reindex. These methods are currently supported by koalas. Therefore, the implementation of the sample function can be based on the current logic of pandas.
I will try to implement the sample function of frame and series from this idea.
I also noticed that Series.sample doesn't support frac right now (koalas 1.4.0). Is that expected? And do you have a timeline for #1893 being merged?
Hi @amueller,
Seems like Series.sample supports the frac parameter now.
- https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.Series.sample.html
For #1893, now it's stuck by a performance concern (https://github.com/databricks/koalas/pull/1893#discussion_r521869698). Could you kindly advice us if you have a good idea?
Thanks!