ray icon indicating copy to clipboard operation
ray copied to clipboard

[Data] - Ray Data Compute Expressions

Open goutamvenkat-anyscale opened this issue 2 months ago • 7 comments

Ray Data’s Expression Namespace System

Summary

Ray Data recently added type-specific expression namespaces that use PyArrow compute functions under the hood:

  • .str — string operations
  • .list — list / variable-length array operations
  • .struct — struct field access

These significantly improve usability, readability, and discoverability of expressions.

This proposal outlines the next phase: completing the namespace system by adding:

  • .dt — datetime operations
  • .arr — fixed-size array operations
  • .map — map/dict-like operations
  • .image — image-specific operations (blur, etc.)
  • .uridownload(), upload() ...

Proposed Additions

Namespace / Location Category Functions
Expr (direct) Arithmetic negate, sign, power, abs
Rounding ceil, floor, round, trunc
Logarithmic ln, log10, log2, exp
Trigonometric sin, cos, tan, asin, acos, atan
Null Handling fill_null, is_finite, is_inf, is_nan
Type Conversion cast
.str Predicates is_alpha, is_digit, is_lower, is_upper
Transforms upper, lower, capitalize, reverse
Manipulation replace, strip, slice, repeat
Padding lpad, rpad, center
Splitting split, split_pattern
Extraction extract, find, count_substring
.dt Extraction year, month, day, hour, minute, second
Formatting strftime
Timezone assume_timezone, tz_convert
.list Operations len, get, slice, sort, flatten
Aggregations sum, mean, min, max
Set Operations union, intersection, difference
.struct Operations field, field_by_index
.map Operations keys, values
.arr Operations flatten, to_list
.uri Multimodal download, upload
.image Multimodal resize, gaussian_blur (and others)

Caveat

Some of these expressions require multiple stages (ex. .uri.download())

Example Usage

Datetime

ds = ds.with_column("year", col("ts").dt.year())
ds = ds.with_column("pretty", col("ts").dt.strftime("%Y-%m-%d"))
ds = ds.with_column("next_hour", col("ts").dt.ceil("hour"))

goutamvenkat-anyscale avatar Nov 16 '25 05:11 goutamvenkat-anyscale

I would love to help on this!

ryankert01 avatar Nov 16 '25 16:11 ryankert01

@ryankert01 Go for it! Ideally I'd like to break this up into multiple PRs (expression based ones, arr, datetime etc.) so it's easier to review.

If you have any questions, please feel free to ask me.

goutamvenkat-anyscale avatar Nov 17 '25 05:11 goutamvenkat-anyscale

@ryankert01 Go for it! Ideally I'd like to break this up into multiple PRs (expression based ones, arr, datetime etc.) so it's easier to review.

If you have any questions, please feel free to ask me.

Can I help with some of this?

400Ping avatar Nov 17 '25 06:11 400Ping

Definitely!

ryankert01 avatar Nov 17 '25 06:11 ryankert01

This is nice. For URI's upload/download, how do we handle credentials and retries etc for cloud storage? I have been using a few customized (pyarrow's) filesystem with Ray Data.

wingkitlee0 avatar Nov 19 '25 01:11 wingkitlee0

Nice feature! Can I extend some operations in the namespaces myself? It would be great if the system were extensible.

codingl2k1 avatar Nov 26 '25 09:11 codingl2k1

I am interested ,can I help with some of this?

myandpr avatar Dec 10 '25 02:12 myandpr