polars
polars copied to clipboard
Add np.linspace-style function
Problem description
pl.arange() does not allow non-integer step sizes. This can be worked around but having the option for non-integer endpoints and step sizes would be a nice feature. In the meanwhile, here is a workaround:
import polars as pl
import numpy as np
import math
low, high, step = 6.7, 10.3, 0.3
# numpy version
x_np = np.arange(low, high, step)
# polars version
def arange_float(high, low, step):
return pl.arange(
low=0,
high=math.floor((high-low)/step),
step=1,
eager=True
).cast(pl.Float32)*step + low
def linspace(high, low, num_points):
step = (high-low)/(num_points-1)
return arange_float(high, low, step)
x_pl = arange_float(high, low, step)
print(x_np)
print(x_pl)
Not sure why np included 10.3, might be float rounding, but regardless:
[ 6.7 7. 7.3 7.6 7.9 8.2 8.5 8.8 9.1 9.4 9.7 10. 10.3]
shape: (12,)
Series: 'arange' [f32]
[
6.7
7.0
7.3
7.6
7.9
8.2
8.5
8.8
9.1
9.4
9.7
10.0
]
I would welcome an implementation of float_range / float_ranges!
@stinodego thoughts on having a single pl.range / pl.ranges that covers both int and float? It could auto-infer the type based on the input arguments, and potentially have a dtype argument to override the auto-inferred type.
Also, would you be open to supporting pl.linspace as well, to match np.linspace? It's often convenient to specify the number of steps instead of the step size. It's also easy to make a mistake - for example, the OP got the implementation wrong! It should be something like arange(low, high + step / 2, step), not arange(low, high, step).
thoughts on having a single pl.range / pl.ranges that covers both int and float? It could auto-infer the type based on the input arguments, and potentially have a dtype argument to override the auto-inferred type.
This might happen in the future. For now, we have specialized ranges for each type.
Not sure about supporting an equivalent for linspace for now. You can relatively easily write your own implementation using a float_range. Maybe in the future.
After some discussing some more with @stinodego we've agreed that float_range is problematic with regards to endpoint handling and float precision, so a linspace-style function would be better suited.
So for the moment the proposed design is
pl.linspace(start, stop, samples, closed="both" | "left" | "right")
samples indicates the number of values returned, similar to np.linspace. closed is a small generalization of numpy's endpoint=True parameter, best explained by example:
pl.linspace(0, 1, 4, closed="both") -> [0.0, 0.333..., 0.666..., 1.0]
pl.linspace(0, 1, 4, closed="left") -> [0.0, 0.25, 0.5, 0.75]
pl.linspace(0, 1, 4, closed="right") -> [0.25, 0.5, 0.75, 1.0]
This function will also support inverted ranges:
pl.linspace(1, 0, 4, closed="both") -> [1.0, 0.666..., 0.333..., 0.0]
pl.linspace(1, 0, 4, closed="left") -> [1.0, 0.75, 0.5, 0.25]
pl.linspace(1, 0, 4, closed="right") -> [0.75, 0.5, 0.25, 0.0]
Finally, the input dtypes allowed are all numeric types (albeit always with a float output), but also dates and times.
The only open question is the name of the function. We're not a huge fan of the name linspace as it clashes with Polars naming policy, so we're open to brainstorming for an alternative name. Options include (but are not limited to):
pl.interval_samplepl.interval_rangepl.linear_samplepl.linear_space
@orlp how about pl.grid? It's both simple and obvious. It also opens up the door (if we want) to creating, say, an N-D grid via a struct or array:
Single dimension
>>> pl.grid(0, 1, 4, closed=Both) -> [0.0, 0.333..., 0.666..., 1.0]
shape: (4,)
Series: '' [f64]
[
0.0
0.333333
0.666667
1.0
]
Two dimensions
>>> pl.grid([0, 4], [2, 0], [4, 3], closed=Both])
shape: (12,)
Series: '' [struct[2]]
[
{0.0,2}
{0.0,1}
{0.0,0}
{0.333333,2}
{0.333333,1}
{0.333333,0}
{0.666667,2}
{0.666667,1}
{0.666667,0}
{1.0,2}
{1.0,1}
{1.0,0}
]
@mcrumiller We did consider grid but didn't like its 2D implication. We're not sure if we want a 2D (or N-D) version at this time.
@orlp makes sense, and for N-D behavior I would be in favor of the name meshgrid anyway, which is what Matlab calls it and makes it a bit more obvious. I will say the implementation would be pretty easy if you utilized our existing cross-join behavior.
I think linspace is the best name here, as it's fairly well-known, and having the closed parameter allows the use of float range behavior with ease.
I really like the use of the closed parameter!
There are a few limitations with the proposed solution:
pl.int_range()is only for integers, andpl.linspace()is only for floats. Sometimes you want linspace-like behavior for integers and vice versa, which is whynp.arange()andnp.linspace()support both integers and floats.pl.int_range()would also benefit from having aclosedargument. Currently it behaves likeclosed='left', but it's pretty common to wantclosed='both', for instance.- It's inconsistent to have
int_range(),time_range(),date_range(), anddatetime_range(), but notfloat_range(). - There's no
linspaces()being proposed, to go withint_ranges(). closed="none"isn't supported, which is inconsistent withtime_range(),date_range(),datetime_range(),is_between(), etc. (Also, shouldn't it be"neither"instead of"none"?)samplesmakes it sound like random sampling.np.linspace()usesnuminstead ofsamples; you could also call itn_steps.
My proposed solution would be to cover the behavior of pl.int_range() and pl.linspace() in a single function:
pl.range(start, stop=None, step=None, *, n_steps=None, closed="both" | "left" | "right" | "neither",
dtype=None | PolarsNumericType)
pl.range(n) would be equivalent to pl.range(0, n), just like in Python. step and n_steps would be mutually exclusive, similar to how n and fraction are in Expr.sample. dtype would be auto-inferred (pl.Int64 if all arguments are ints, pl.Float64 if any argument is a float) or set to any numeric dtype. You could also have pl.ranges() similar to the current pl.int_ranges(). closed="none" would be renamed to closed="neither" for all polars functions and methods that support a closed parameter.
@Wainberg
-
pl.linspacewould also accept int inputs, but its outputs would always be floats (or datetimes/times/durations for those relevant types). -
Perhaps
int_rangecould use the closed argument but that's for a different issue. -
It may be inconsistent, but
float_rangeis just a highly problematic function due to the rounding errors introduced by IEEE 754 floating point. E.g.float_range(0, 0.9, 0.1)would result in[0, 0.1, ..., 0.8], butfloat_range(0, 0.9, 0.3)would result in[0, 0.3, 0.6, 0.8999999999999999]because in floating point arithmetic0.1 * 9 >= 0.9but0.1 * 3 < 0.9.Note that numpy itself also recommends you to not use
arangefor float steps: "When using a non-integer step, such as 0.1, it is often better to use numpy.linspace." It also has a complete warning block explaining how "The length of the output might not be numerically stable." In a columnar dataframe library where we expect columns to have equal lengths within a dataframe that is a rather huge footgun. -
I should have specified that, yes,
pl.linspaceswould be included. -
I don't have an interpretation of what
closed="none"could be for linearly spaced values (I agree w.r.t. neither vs none but not sure if it's worth changing). -
Numpy also calls them samples: "Returns num evenly spaced samples". Not a huge fan of
n_stepsbecause it's just not correct: inlinspace(0, 1, 4)you take 3 steps to go from the start to the stop. And I don't thinknumis particularly descriptive.
I am not a huge fan of having a single function that covers both use-cases. The functions just do different things, especially with respect to their interpretation of closedness (for int_range the closedness only refers to the endpoints of the complete range, whereas for linspace it refers to how each sample should be interpreted). In general I think having arguments that are mutually exclusive with other arguments is poor design in Python. We should be removing cases where we do that, instead of adding more.
My 2 cents on this: there is the issue #7525 for adding the periods argument to the pl.date_range function. I wanted to point out that adding this argument gives you both "arange"-type and "linspace"-type behaviour. So whether this linspace is consolidated into a *_range function, or it is separate, it might also make sense to do the same with date ranges.
Note that date/datetimes have an integer representation, so the issues regarding floating points still stand.
You can draw parallels with the pandas.date_range function. In pandas.date_range you have four parameter start, end, periods, and freq, and you must specify exactly 3, i.e. leave out one of them.
from datetime import datetime, timedelta
import polars as pl
import pandas as pd
start = datetime(2024, 1, 1)
end = datetime(2024, 1, 2)
periods = 3
freq = timedelta(hours=8)
# Combination 1: leave out `periods`
pd.date_range(start=start, end=end, freq=freq)
# Analogous to current `pl.int_range()` and `pl.datetime_range()`
pl.datetime_range(start=start, end=end, interval=freq, eager=True)
pl.int_range(start=0, end=10, step=1, eager=True)
# Combination 2: leave out `freq`
pd.date_range(start=start, end=end, periods=periods)
# Not implemented in polars, but this is np.linspace behaviour
# Combination 3: leave out `end`
pd.date_range(start=start, freq=freq, periods=periods)
# Best there is
pl.datetime_range(start=start, end=start + freq * periods, interval=freq, closed="left", eager=True)
pl.int_range(start=2, end=2 + 3 * 10, step=3, eager=True)
# Combination 4: leave out `start`
pd.date_range(end=end, freq=freq, periods=periods)
# Best there is
pl.datetime_range(start=end - freq * periods, end=end, interval=freq, closed="right", eager=True)
pl.int_range(start=31 - 3 * 5, end=31, step=3, eager=True)
My proposed solution would be to cover the behavior of
pl.int_range()andpl.linspace()in a single function:
Big +1 from me. This is how Julia’s range function works by default and it's great.
range(start, stop, length)
range(start, stop; length, step)
range(start; length, stop, step)
range(;start, length, stop, step)
Construct a specialized array with evenly spaced elements and optimized storage (an
AbstractRange) from the arguments. Mathematically a range is uniquely determined by any three of
start, step, stop and length. Valid invocations of range are:
• Call range with any three of start, step, stop, length.
• Call range with two of start, stop, length. In this case step will be assumed to be
one. If both arguments are Integers, a UnitRange will be returned.
• Call range with one of stop or length. start and step will be assumed to be one.
[In Python we'd want start to default to 0]