logfire
logfire copied to clipboard
We should build a polars instrumentation
it would give you query/operation times, and the query plan, see this tweet 🐦 .
Richie says:
I don't have the full context, but the query plan can be serialized to json or visualized via
.explain.
So we probably hook into .collect() for LazyFrame? Really it would be nice for polars to provide a global register of hooks to call when it does work.
cc @ritchie46 in case you have any thoughts?
To be clear that integration doesn't have to be logfire specific in any way. It could just be something like:
class PolarsTrace:
def on_start(self, **payload) -> None: ...
def on_end(self, **payload) -> None: ...
class PolarsTracer:
def start_span(self) -> PolarsTrace: ...
I guess the question is what goes in payload, I'd say anything thats "free" to get (presumably a plan, execution time, etc.).
I am not entirely sure I understand what a trace is exactly yet.
Things that might be interesting. .profile gives you the result and a DataFrame containing the timings per operations (optimizer, join, grouping etc).
pl.LazyFrame({"foo": [1, 2, 3]}).serialize() gives you JSON repr of the logical plan, which can be reconstructed later (this is cheap as longs as there are no real in-memory tables in the plan).
Other than that there is a hook where you can get a hold of the IR post optimization (https://github.com/pola-rs/polars/pull/15972). Though this is very much leaking internals. We add it so that we can hook cudf in, but I am not sure it would be wise to build upon this.
I am not entirely sure I understand what a trace is exactly yet.
A trace is really similar to a log statement. The only differences is that it has a start and end (and hence a duration) and a context (where in the execution of the program it started and where it ended).
The LazyFrame.profile() information seems like the kind of thing we'd want. Here's a hacky version:
from typing import Any
import polars as pl
class Tracer:
def on_collect(self, input: pl.LazyFrame, output: pl.DataFrame, profile: pl.DataFrame, plan: dict[str, Any]) -> None:
# this would be sent to a remote server not just printed
print(profile)
def _collect_patch(df: pl.LazyFrame, *args: Any, **kwargs: Any) -> pl.DataFrame:
plan = df.serialize()
res, stats = df.profile(*args, **kwargs)
for tracer in pl.tracers:
tracer.on_collect(df, res, stats, plan)
return res
pl.tracers = [Tracer()]
pl.LazyFrame.collect = _collect_patch
df = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}).lazy()
print(df.sort('a').collect())
The idea would be for polars to provide a hook so that we don't need to monkey patch like this.
I assume a LazyFrame computes a plan on each call chain (e.g. .sort() creates a new LazyFrame with a new plan) so a hook for that that only calls with the .serialize() result would make sense as well.
Is it possible to get similar information for each step of execution to a DataFrame?