polars icon indicating copy to clipboard operation
polars copied to clipboard

Allow array arguments to `search_sorted`

Open ecotner opened this issue 2 years ago • 1 comments

Problem description

I was working with some data earlier where I had constructed an eCDF and wanted to construct a table/dataframe that looked at the value of the distribution function at the nearest (or reasonably close) data points to a certain set of values (e.g. find the y paired with the closest x). One approach I'm familiar with would be to find the insertion order of that set of values within my data. I found that polars has the Series.search_sorted expression which looked like it would have been appropriate, except it only accepts scalar arguments. pandas' version Series.searchsorted on the other hand allows you to pass in array-like types or scalars so I went back to using it. So what I ended up doing (in pandas) was:

df["y"].iloc[df["x"].searchsorted(vals)]

In polars, I assumed a similar approach would be

df["y"].take(df["x"].search_sorted(vals))

except when I tried this I get the following error:

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df["y"].take(df["x"].search_sorted(vals))

File /opt/conda/lib/python3.9/site-packages/polars/internals/series/series.py:1983, in Series.search_sorted(self, element)
   1971 def search_sorted(self, element: int | float) -> int:
   1972     """
   1973     Find indices where elements should be inserted to maintain order.
   1974 
   (...)
   1981 
   1982     """
-> 1983     return pli.select(pli.lit(self).search_sorted(element))[0, 0]

File /opt/conda/lib/python3.9/site-packages/polars/internals/lazy_functions.py:1875, in select(exprs)
   1835 def select(
   1836     exprs: str | pli.Expr | Sequence[str | pli.Expr] | pli.Series,
   1837 ) -> pli.DataFrame:
   1838     """
   1839     Run polars expressions without a context.
   1840 
   (...)
   1873 
   1874     """
-> 1875     return pli.DataFrame([]).select(exprs)

File /opt/conda/lib/python3.9/site-packages/polars/internals/dataframe/frame.py:5414, in DataFrame.select(self, exprs)
   5318 def select(
   5319     self: DF,
   5320     exprs: str
   (...)
   5323     | Sequence[str | pli.Expr | pli.Series | pli.WhenThen | pli.WhenThenThen],
   5324 ) -> DF:
   5325     """
   5326     Select columns from this DataFrame.
   5327 
   (...)
   5411 
   5412     """
   5413     return self._from_pydf(
-> 5414         self.lazy()
   5415         .select(exprs)
   5416         .collect(no_optimization=True, string_cache=False)
   5417         ._df
   5418     )

File /opt/conda/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py:1046, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown, common_subplan_elimination, allow_streaming)
   1035     common_subplan_elimination = False
   1037 ldf = self._ldf.optimization_toggle(
   1038     type_coercion,
   1039     predicate_pushdown,
   (...)
   1044     allow_streaming,
   1045 )
-> 1046 return pli.wrap_df(ldf.collect())

PanicException: dtype List(shape: (12,)
Series: '' [f64]
[
	0.000017
	0.000167
	0.001667
	0.016667
	0.083333
	0.5
	1.0
	5.0
	15.0
	60.0
	120.0
	300.0
]) not implemented

Which I assume is due to array type arguments for Series.search_sorted not being implemented (it works fine if vals is a scalar). If this functionality could be added I think it would make a good addition to the library!

Temporary workaround for now is to just use pandas instead, or alternatively do some kind of loop over the elements of vals to accumulate the results of polars' Series.search_sorted using the scalar argument. For a small enough vals, the latter doesn't seem too bad (and in my case it was only about 10 elements so would have been perfectly fine), but for many thousands or millions of elements, that might get inefficient. Also open to any other suggestions. I know use of indexes is considered an anti-pattern in polars, so perhaps there is another preferred alternative I have overlooked.

ecotner avatar Nov 23 '22 00:11 ecotner

If you have many of them you probably need to do an asof_join. Have you tried that?

ritchie46 avatar Nov 25 '22 08:11 ritchie46