polars icon indicating copy to clipboard operation
polars copied to clipboard

Expr.format(fmt) - convert column to string with custom format string

Open 2-5 opened this issue 2 years ago • 8 comments
trafficstars

Problem description

Add a format method to Expr to allow fast column formatting. format can use the Rust syntax:

df = pl.DataFrame({"month": [1, 5, 10, 12]})
df.with_columns([
    pl.col("month").format("month: {:02}").alias("month_desc"),
])

2-5 avatar Feb 23 '23 20:02 2-5

FYI, here's my sprintf that I wrote for using C-style sprint formatting. I didn't cover every case but it hits the most common ones.

def sprintf(s, fmt):
    """
    Convert a polars series to string format

    Inputs:
      * s - string or expression
      * fmt - format specifier in fprint format, e.g. "%0.2f". Only specifiers of s, d, and f are supported.
              If specifier 's' is provided, alignment arguments of '>' and '<' are allowed:
                  '>5s' - right-align, width 5
                  '<5s' - left-align, width 5

    """
    # parse format
    parser = re.compile(r"^%(?P<pct>%?)(?P<align>[\<\>|]?)(?P<head>\d*)(?P<dot>\.?)(?P<dec>\d*)(?P<char>[dfs])$")
    result = parser.match(fmt)
    if not result:
        raise ValueError(f"Invalid format {fmt} specified.")

    # determine total width & leading zeros
    head = result.group("head")
    if head != '':
        total_width = int(head)
        lead_zeros = head[0] == '0'
    else:
        total_width = 0
        lead_zeros = False

    # determine # of decimals
    if result.group("char") == 's':
        # string requested: return immediately
        expr = s.str.ljust(total_width) if result.group("align") == '<' else s.str.rjust(total_width)
        return pl.select(expr).to_series() if isinstance(s, pl.Series) else expr

    elif result.group("char") == 'd' or result.group("dot") != '.':
        num_decimals = 0
    else:
        num_decimals = int(result.group("dec"))

    # determine whether to display as percent
    if result.group("pct") == '%':
        s, pct = (s*100, [pl.lit('%')])
    else:
        s, pct = (s, [])

    # we require float dtype to perform any rounding
    s = s.cast(pl.Float32).round(num_decimals)

    if num_decimals > 0:
        # compute head portion
        head_width = max(0, total_width - num_decimals - 1)
        head = when(s < 0).then(s.ceil()).otherwise(s.floor())

        # compute decimal portion
        decimal = (s-head)
        tail = [
            pl.lit('.'),
            (decimal*(10**num_decimals)).round(0).cast(pl.UInt16).cast(pl.Utf8).str.rjust(num_decimals, '0')
        ]
        head = head.cast(pl.Int32).cast(pl.Utf8)
    else:
        # we only have head portion
        head_width = total_width
        head = s.cast(pl.Int32).cast(pl.Utf8)
        tail = []

    head = head.str.zfill(head_width) if lead_zeros else head.str.rjust(head_width)
    expr = pl.concat_str([head, *tail, *pct])

    return pl.select(expr).to_series() if isinstance(s, pl.Series) else expr

mcrumiller avatar Feb 23 '23 20:02 mcrumiller

import polars as pl
df = pl.DataFrame({
    'a': [0, 1, 2, 3, 4]
})

works on a series

>>> sprintf(df['a'], "%0.2f")
shape: (5,)
Series: 'a' [str]
[
        "0.00"
        "1.00"
        "2.00"
        "3.00"
        "4.00"
]

works in an expression context

>>> df.select(
    sprintf(pl.col('a').cast(pl.Utf8), "%>5s")
)
shape: (5, 1)
┌───────┐
│ a     │
│ ---   │
│ str   │
╞═══════╡
│     0 │
│     1 │
│     2 │
│     3 │
│     4 │
└───────┘

mcrumiller avatar Feb 23 '23 21:02 mcrumiller

On similar lines, it would be immensely useful if pl.format() https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.format.html would take std::fmt https://doc.rust-lang.org/std/fmt/ style format string

pankajp avatar Feb 28 '23 06:02 pankajp

On similar lines, it would be immensely useful if pl.format() https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.format.html would take std::fmt https://doc.rust-lang.org/std/fmt/ style format string

That format is at compile time. We cannot access it on runtime.

ritchie46 avatar Feb 28 '23 06:02 ritchie46

Oh, that's too bad. Can we use something like https://docs.rs/runtime-fmt/latest/runtime_fmt/ ? Or else maybe @mcrumiller 's https://github.com/pola-rs/polars/issues/7133#issuecomment-1442422524 style translator from python's format string to polars expressions using a string.Formatter subclass https://docs.python.org/3/library/string.html#string.Formatter only for the python bindings

pankajp avatar Feb 28 '23 09:02 pankajp

That format is at compile time. We cannot access it on runtime.

@ritchie46, if the input into whatever format-type function is used is hard-coded string, is this not possible to generate? I'm assuming python doesn't distinguish between constexpr-type string literal and regular python strings generated during runtime.

mcrumiller avatar Feb 28 '23 14:02 mcrumiller

I'm assuming python doesn't distinguish between constexpr-type string literal and regular python strings generated during runtime.

They are discussing parsing constexpr-type string literals at bytecode compilation time: https://peps.python.org/pep-0701/ But the reason is not performance.

2-5 avatar Feb 28 '23 14:02 2-5

It looks like the two most popular runtime f-string crates nowadays are https://lib.rs/crates/formatx and https://lib.rs/crates/strfmt. Would there be interest in using one of these in Expr.format() and df.write_csv()?

Wainberg avatar Jan 13 '24 23:01 Wainberg

Any updates on this?

mkleinbort-ic avatar Feb 13 '24 15:02 mkleinbort-ic

Just an example in rust cargo add polars -F string_pad then with a dataframe in rust you can use zfill as below

let out = df
        .clone()
        .lazy()
        .select([
            col("date")
                .dt()
                .month()
                .cast(DataType::String)
                .str()
                .zfill(lit(2))
                .alias("month"),
            col("date").dt().year().alias("year"),
        ])
        .collect()
        .unwrap();

notmu avatar Mar 05 '24 02:03 notmu