ibis icon indicating copy to clipboard operation
ibis copied to clipboard

[HELP NEEDED] How to insert `RAND(seed)` in `ORDER BY` clause for impala backend?

Open 94929 opened this issue 10 months ago • 2 comments

Problem

I need deterministic random sampling within partitions using a seed value. I want to generate SQL like ORDER BY RAND(42) but cannot find a way to pass a seed parameter to ibis random functions.

Current Code

# This works for random sampling (no seed)
order_by = [ibis.random()]
window_spec = ibis.window(group_by=['day_partition', 'hour_partition'], order_by=order_by)

Desired SQL Output

ROW_NUMBER() OVER (PARTITION BY day_partition, hour_partition ORDER BY RAND(42))

Attempted Solutions

  1. ibis.random() with seed - No seed parameter supported
  2. ops.RandomScalar(seed=42) - TypeError: got an unexpected keyword argument 'seed'
  3. ibis.sql(f"RAND({seed})") - AttributeError: module 'ibis' has no attribute 'sql'
  4. ibis.literal(f"RAND({seed})") - Produces string literal 'RAND(42)' instead of function call
  5. ibis.param("int64") - Results in ORDER BY 42 (constant value, no randomness)

Thanks in advance.

94929 avatar Jun 01 '25 13:06 94929

There are a number of issues with making an API for this. See here for a discussion.

In the meantime, for Impala, you should be able to do this:

In [38]: con = ibis.impala.connect()

In [39]: t = con.create_table("t", obj=pd.DataFrame({"a": [1, 2, 3]}))

In [40]: t
Out[40]:
┏━━━━━━━┓
┃ a     ┃
┡━━━━━━━┩
│ int64 │
├───────┤
│     3 │
│     2 │
│     1 │
└───────┘

In [41]: randomized = t.sql("select * from t order by rand(42)")

In [42]: randomized
Out[42]:
┏━━━━━━━┓
┃ a     ┃
┡━━━━━━━┩
│ int64 │
├───────┤
│     2 │
│     3 │
│     1 │
└───────┘

cpcloud avatar Jun 01 '25 13:06 cpcloud

Thanks for the explanation! I applied the method you suggested : )

94929 avatar Jun 05 '25 04:06 94929