polars
polars copied to clipboard
Expr.format(fmt) - convert column to string with custom format string
Problem description
Add a format method to Expr to allow fast column formatting. format can use the Rust syntax:
df = pl.DataFrame({"month": [1, 5, 10, 12]})
df.with_columns([
pl.col("month").format("month: {:02}").alias("month_desc"),
])
FYI, here's my sprintf that I wrote for using C-style sprint formatting. I didn't cover every case but it hits the most common ones.
def sprintf(s, fmt):
"""
Convert a polars series to string format
Inputs:
* s - string or expression
* fmt - format specifier in fprint format, e.g. "%0.2f". Only specifiers of s, d, and f are supported.
If specifier 's' is provided, alignment arguments of '>' and '<' are allowed:
'>5s' - right-align, width 5
'<5s' - left-align, width 5
"""
# parse format
parser = re.compile(r"^%(?P<pct>%?)(?P<align>[\<\>|]?)(?P<head>\d*)(?P<dot>\.?)(?P<dec>\d*)(?P<char>[dfs])$")
result = parser.match(fmt)
if not result:
raise ValueError(f"Invalid format {fmt} specified.")
# determine total width & leading zeros
head = result.group("head")
if head != '':
total_width = int(head)
lead_zeros = head[0] == '0'
else:
total_width = 0
lead_zeros = False
# determine # of decimals
if result.group("char") == 's':
# string requested: return immediately
expr = s.str.ljust(total_width) if result.group("align") == '<' else s.str.rjust(total_width)
return pl.select(expr).to_series() if isinstance(s, pl.Series) else expr
elif result.group("char") == 'd' or result.group("dot") != '.':
num_decimals = 0
else:
num_decimals = int(result.group("dec"))
# determine whether to display as percent
if result.group("pct") == '%':
s, pct = (s*100, [pl.lit('%')])
else:
s, pct = (s, [])
# we require float dtype to perform any rounding
s = s.cast(pl.Float32).round(num_decimals)
if num_decimals > 0:
# compute head portion
head_width = max(0, total_width - num_decimals - 1)
head = when(s < 0).then(s.ceil()).otherwise(s.floor())
# compute decimal portion
decimal = (s-head)
tail = [
pl.lit('.'),
(decimal*(10**num_decimals)).round(0).cast(pl.UInt16).cast(pl.Utf8).str.rjust(num_decimals, '0')
]
head = head.cast(pl.Int32).cast(pl.Utf8)
else:
# we only have head portion
head_width = total_width
head = s.cast(pl.Int32).cast(pl.Utf8)
tail = []
head = head.str.zfill(head_width) if lead_zeros else head.str.rjust(head_width)
expr = pl.concat_str([head, *tail, *pct])
return pl.select(expr).to_series() if isinstance(s, pl.Series) else expr
import polars as pl
df = pl.DataFrame({
'a': [0, 1, 2, 3, 4]
})
works on a series
>>> sprintf(df['a'], "%0.2f")
shape: (5,)
Series: 'a' [str]
[
"0.00"
"1.00"
"2.00"
"3.00"
"4.00"
]
works in an expression context
>>> df.select(
sprintf(pl.col('a').cast(pl.Utf8), "%>5s")
)
shape: (5, 1)
┌───────┐
│ a │
│ --- │
│ str │
╞═══════╡
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└───────┘
On similar lines, it would be immensely useful if pl.format() https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.format.html would take std::fmt https://doc.rust-lang.org/std/fmt/ style format string
On similar lines, it would be immensely useful if
pl.format()https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.format.html would takestd::fmthttps://doc.rust-lang.org/std/fmt/ style format string
That format is at compile time. We cannot access it on runtime.
Oh, that's too bad. Can we use something like https://docs.rs/runtime-fmt/latest/runtime_fmt/ ?
Or else maybe @mcrumiller 's https://github.com/pola-rs/polars/issues/7133#issuecomment-1442422524 style translator from python's format string to polars expressions using a string.Formatter subclass https://docs.python.org/3/library/string.html#string.Formatter only for the python bindings
That format is at compile time. We cannot access it on runtime.
@ritchie46, if the input into whatever format-type function is used is hard-coded string, is this not possible to generate? I'm assuming python doesn't distinguish between constexpr-type string literal and regular python strings generated during runtime.
I'm assuming python doesn't distinguish between constexpr-type string literal and regular python strings generated during runtime.
They are discussing parsing constexpr-type string literals at bytecode compilation time: https://peps.python.org/pep-0701/ But the reason is not performance.
It looks like the two most popular runtime f-string crates nowadays are https://lib.rs/crates/formatx and https://lib.rs/crates/strfmt. Would there be interest in using one of these in Expr.format() and df.write_csv()?
Any updates on this?
Just an example in rust
cargo add polars -F string_pad
then with a dataframe in rust you can use zfill as below
let out = df
.clone()
.lazy()
.select([
col("date")
.dt()
.month()
.cast(DataType::String)
.str()
.zfill(lit(2))
.alias("month"),
col("date").dt().year().alias("year"),
])
.collect()
.unwrap();