[Data] - Ray Data Compute Expressions
Ray Data’s Expression Namespace System
Summary
Ray Data recently added type-specific expression namespaces that use PyArrow compute functions under the hood:
-
.str— string operations -
.list— list / variable-length array operations -
.struct— struct field access
These significantly improve usability, readability, and discoverability of expressions.
This proposal outlines the next phase: completing the namespace system by adding:
-
.dt— datetime operations -
.arr— fixed-size array operations -
.map— map/dict-like operations -
.image— image-specific operations (blur, etc.) -
.uri—download(),upload()...
Proposed Additions
| Namespace / Location | Category | Functions |
|---|---|---|
| Expr (direct) | Arithmetic | negate, sign, power, abs |
| Rounding | ceil, floor, round, trunc |
|
| Logarithmic | ln, log10, log2, exp |
|
| Trigonometric | sin, cos, tan, asin, acos, atan |
|
| Null Handling | fill_null, is_finite, is_inf, is_nan |
|
| Type Conversion | cast |
|
| .str | Predicates | is_alpha, is_digit, is_lower, is_upper |
| Transforms | upper, lower, capitalize, reverse |
|
| Manipulation | replace, strip, slice, repeat |
|
| Padding | lpad, rpad, center |
|
| Splitting | split, split_pattern |
|
| Extraction | extract, find, count_substring |
|
| .dt | Extraction | year, month, day, hour, minute, second |
| Formatting | strftime |
|
| Timezone | assume_timezone, tz_convert |
|
| .list | Operations | len, get, slice, sort, flatten |
| Aggregations | sum, mean, min, max |
|
| Set Operations | union, intersection, difference |
|
| .struct | Operations | field, field_by_index |
| .map | Operations | keys, values |
| .arr | Operations | flatten, to_list |
| .uri | Multimodal | download, upload |
| .image | Multimodal | resize, gaussian_blur (and others) |
Caveat
Some of these expressions require multiple stages (ex. .uri.download())
Example Usage
Datetime
ds = ds.with_column("year", col("ts").dt.year())
ds = ds.with_column("pretty", col("ts").dt.strftime("%Y-%m-%d"))
ds = ds.with_column("next_hour", col("ts").dt.ceil("hour"))
I would love to help on this!
@ryankert01 Go for it! Ideally I'd like to break this up into multiple PRs (expression based ones, arr, datetime etc.) so it's easier to review.
If you have any questions, please feel free to ask me.
@ryankert01 Go for it! Ideally I'd like to break this up into multiple PRs (expression based ones, arr, datetime etc.) so it's easier to review.
If you have any questions, please feel free to ask me.
Can I help with some of this?
Definitely!
This is nice.
For URI's upload/download, how do we handle credentials and retries etc for cloud storage? I have been using a few customized (pyarrow's) filesystem with Ray Data.
Nice feature! Can I extend some operations in the namespaces myself? It would be great if the system were extensible.
I am interested ,can I help with some of this?