polars
polars copied to clipboard
Allow array arguments to `search_sorted`
Problem description
I was working with some data earlier where I had constructed an eCDF and wanted to construct a table/dataframe that looked at the value of the distribution function at the nearest (or reasonably close) data points to a certain set of values (e.g. find the y
paired with the closest x
). One approach I'm familiar with would be to find the insertion order of that set of values within my data. I found that polars
has the Series.search_sorted expression which looked like it would have been appropriate, except it only accepts scalar arguments. pandas
' version Series.searchsorted on the other hand allows you to pass in array-like types or scalars so I went back to using it. So what I ended up doing (in pandas
) was:
df["y"].iloc[df["x"].searchsorted(vals)]
In polars
, I assumed a similar approach would be
df["y"].take(df["x"].search_sorted(vals))
except when I tried this I get the following error:
---------------------------------------------------------------------------
PanicException Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df["y"].take(df["x"].search_sorted(vals))
File /opt/conda/lib/python3.9/site-packages/polars/internals/series/series.py:1983, in Series.search_sorted(self, element)
1971 def search_sorted(self, element: int | float) -> int:
1972 """
1973 Find indices where elements should be inserted to maintain order.
1974
(...)
1981
1982 """
-> 1983 return pli.select(pli.lit(self).search_sorted(element))[0, 0]
File /opt/conda/lib/python3.9/site-packages/polars/internals/lazy_functions.py:1875, in select(exprs)
1835 def select(
1836 exprs: str | pli.Expr | Sequence[str | pli.Expr] | pli.Series,
1837 ) -> pli.DataFrame:
1838 """
1839 Run polars expressions without a context.
1840
(...)
1873
1874 """
-> 1875 return pli.DataFrame([]).select(exprs)
File /opt/conda/lib/python3.9/site-packages/polars/internals/dataframe/frame.py:5414, in DataFrame.select(self, exprs)
5318 def select(
5319 self: DF,
5320 exprs: str
(...)
5323 | Sequence[str | pli.Expr | pli.Series | pli.WhenThen | pli.WhenThenThen],
5324 ) -> DF:
5325 """
5326 Select columns from this DataFrame.
5327
(...)
5411
5412 """
5413 return self._from_pydf(
-> 5414 self.lazy()
5415 .select(exprs)
5416 .collect(no_optimization=True, string_cache=False)
5417 ._df
5418 )
File /opt/conda/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py:1046, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown, common_subplan_elimination, allow_streaming)
1035 common_subplan_elimination = False
1037 ldf = self._ldf.optimization_toggle(
1038 type_coercion,
1039 predicate_pushdown,
(...)
1044 allow_streaming,
1045 )
-> 1046 return pli.wrap_df(ldf.collect())
PanicException: dtype List(shape: (12,)
Series: '' [f64]
[
0.000017
0.000167
0.001667
0.016667
0.083333
0.5
1.0
5.0
15.0
60.0
120.0
300.0
]) not implemented
Which I assume is due to array type arguments for Series.search_sorted
not being implemented (it works fine if vals
is a scalar). If this functionality could be added I think it would make a good addition to the library!
Temporary workaround for now is to just use pandas
instead, or alternatively do some kind of loop over the elements of vals
to accumulate the results of polars
' Series.search_sorted
using the scalar argument. For a small enough vals
, the latter doesn't seem too bad (and in my case it was only about 10 elements so would have been perfectly fine), but for many thousands or millions of elements, that might get inefficient. Also open to any other suggestions. I know use of indexes is considered an anti-pattern in polars
, so perhaps there is another preferred alternative I have overlooked.
If you have many of them you probably need to do an asof_join
. Have you tried that?