cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[QST] Changing `cudf::rolling_window` API to have exclusive `preceding` parameter

Open codereport opened this issue 4 years ago • 12 comments

Should we change cudf::rolling_window API so that the preceding parameter is exclusive?

This is the current API for cudf::rolling_window:

std::unique_ptr<column> rolling_window(
  column_view const& input,
  size_type preceding_window,
  size_type following_window,
  size_type min_periods,
  std::unique_ptr<aggregation> const& agg,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

The question is whether preceding should be exclusive or inclusive. Current examples of behaviour:

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, 2, 1, 1, ???); // gets you [1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5]
cudf::rolling_window(col, 2, 1, 1, min); //            1,        1,         2,         3,       4
cudf::rolling_window(col, 2, 1, 1, max); //            2,        3,         4,         5,       5 
cudf::rolling_window(col, 2, 1, 1, sum); //            3,        6,         9,        12,       9

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, 1, 1, 1, ???); // gets you [1, 2], [2, 3], [3, 4], [4, 5], [5]
cudf::rolling_window(col, 1, 1, 1, min); //            1,      2,      3,      4,     5
cudf::rolling_window(col, 1, 1, 1, max); //            2,      3,      4,      5,     5
cudf::rolling_window(col, 1, 1, 1, sum); //            3,      5,      7,      9,     5

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, 1, 0, 1, ???); // gets you [1], [2], [3], [4], [5]
cudf::rolling_window(col, 1, 0, 1, min); //           1,   2,   3,   4,   5
cudf::rolling_window(col, 1, 0, 1, max); //           1,   2,   3,   4,   5
cudf::rolling_window(col, 1, 0, 1, sum); //           1,   2,   3,   4,   5

Recommendation would be that all of the preceding parameters we be "reduce by 1" and the current index would be included by default. Therefore window length would always be preceding + following + 1. However, need to consider the API that takes a column of window sizes and also the future changes @mythrocks will make.

Previous discussion:

  • https://github.com/rapidsai/cudf/pull/3305#discussion_r352659894
  • https://github.com/rapidsai/cudf/pull/3305#discussion_r353878021

Relevant SQL Links:

  • https://docs.microsoft.com/en-us/sql/t-sql/queries/select-over-clause-transact-sql?view=sql-server-ver15
  • https://www.red-gate.com/simple-talk/sql/learn-sql-server/window-functions-in-sql-server-part-2-the-frame/

codereport avatar Jan 07 '21 01:01 codereport

Another important current example:

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, 0, 0, 1, ???); // null, null, null, null, null
cudf::rolling_window(col, 0, 0, 1, min); // null, null, null, null, null
etc.

I just worked through the logic of some corner cases, and thought I would document it to help improve understanding and help others work examples. Here's the current logic for computing the window start and end.

size_type preceding_window = preceding_window_begin[i];
size_type following_window = following_window_begin[i];

// compute bounds
size_type start       = min(input.size(), max(0, i - preceding_window + 1));
size_type end         = min(input.size(), max(0, i + following_window + 1));
size_type start_index = min(start, end);
size_type end_index   = max(start, end); 

And the window processing loops all look like for (size_type j = start_index; j < end_index; j++) { ... } -- so end_index is not included.

I guess what is being proposed is to change it so that passing [0, 0, 1] for preceding, following, min_periods gives you the identity, rather than the current behavior where [1, 0, 1] gives you the identity. So my first attempt at the new logic would be:

size_type preceding_window = preceding_window_begin[i];
size_type following_window = following_window_begin[i];

// compute bounds
size_type start       = min(input.size(), max(0, i - preceding_window));
size_type end         = min(input.size(), max(0, i + following_window + 1));
size_type start_index = min(start, end);
size_type end_index   = max(start, end); 

(Leaving the processing loop alone.) Notably, my example above would now produce non-null results.

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, 0, 0, 1, ???); // [1], [2], [3], [4], [5]
cudf::rolling_window(col, 0, 0, 1, min); //  1,   2,   3,   4,   5
etc.

To recover the previous zero-window-size behavior would require passing -1 for either one of the window sizes.

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, -1, 0, 1, ???); // null, null, null, null, null
cudf::rolling_window(col, 0, -1, 1, ???); // null, null, null, null, null

But (I think surprisingly), passing -1 for both would be the same as passing zero for both!

// auto col = {1, 2, 3, 4, 5};
cudf::rolling_window(col, -1, -1, 1, ???); // [1], [2], [3], [4], [5]

I would think that (-1, -1) should result in the reverse of (1, 1). This makes me think that we would instead want to change the implementation to this:

size_type preceding_window = preceding_window_begin[i];
size_type following_window = following_window_begin[i];

// compute bounds
size_type start       = min(input.size() - 1, max(0, i - preceding_window));
size_type end         = min(input.size() - 1, max(0, i + following_window));
size_type start_index = min(start, end);
size_type end_index   = max(start, end); 

And change the for loop to be inclusive. for (size_type j = start_index; j <= end_index; j++) { ... }

Important to also note the changes to the first argument to min() for start and end.

// auto col = {1, 2, 3, 4, 5};

// 0, 0, 1 still gives the identity
cudf::rolling_window(col, 0, 0, 1, ???); // [1], [2], [3], [4], [5]

// Note the difference here: 
cudf::rolling_window(col, -1, 0, 1, ???); // [1 2], [2 3], [3 4], [4 5], [5]
cudf::rolling_window(col, 0, -1, 1, ???); // [1], [1 2], [2 3], [3 4], [4 5]

// The above are the same as the reversed, negated windows (this seems consistent)
cudf::rolling_window(col, 0, 1, 1, ???); // [1 2], [2 3], [3 4], [4 5], [5]
cudf::rolling_window(col, 1, 0, 1, ???); // [1], [1 2], [2 3], [3 4], [4 5]

// -1 twice is the same as 1 twice, as expected
cudf::rolling_window(col, 1, 1, 1, ???);   // [1 2], [2 3], [3 4], [4 5], [5]
cudf::rolling_window(col, -1, -1, 1, ???); // [1 2], [2 3], [3 4], [4 5], [5]

But now how do we do a zero window size? The only way is to set min_periods greater than 1.

cudf::rolling_window(col, 0, 0, 2, ???);   // null, null, null, null, null

This kind of bothers me, but at least there is a way. And the rest of the cases are more self consistent than other implementations.

harrism avatar Jan 07 '21 02:01 harrism

BTW, I found the original negative window sizes request. https://github.com/rapidsai/cudf/issues/3650

Interestingly, the reason for the request was to enable windows that DO NOT include the current row/element. So I don't think negative window sizes are strictly necessary, just a way to exclude the current row. The assumption that the current row should always be included is not a valid assumption, it seems.

Let's take the example from that issue:

So for example I may want the MAX of the 5 entries just before the current one.

The existing behavior where one of the two window sizes is exclusive makes this possible (along with negative window sizes).

cudf::rolling_window(col, 6, -1, 1, max); // Admittedly, this is awkward!

The proposed new behavior would make it impossible to express -- the current element is always included.

harrism avatar Jan 07 '21 02:01 harrism

I need to correct myself. Using 5, -1, 1 should work for a window that only includes the 5 elements before the current element, and not the current element!

harrism avatar Jan 08 '21 01:01 harrism

Before we do anything here, we need to gather further requirements. For example @mythrocks mentioned the need for offset windows. @mythrocks can you please elaborate?

harrism avatar Jan 08 '21 01:01 harrism

For example @mythrocks mentioned the need for offset windows. @mythrocks can you please elaborate?

An offset window is one where the two ends of the window do not straddle the row itself. For instance, consider this column:

// auto col = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

Consider a SUM() operation over records in a span [2 ROWS PRECEDING, 1 ROW PRECEDING]. All the rows participating in a window appear before (i.e. to the left) of the current row. E.g. For row at index 5, the sum is 7 (i.e. 3+4). Assuming the current semantics of preceding/following, we might represent this as:

cudf::rolling_window(col, 3, -1, 1, sum); // {ω, 0, 1, 3, 5, 9, 11, 13, 15}

We will need to add checks so that the left bound do not appear to the right of the right bound. (E.g. rolling_window(col 3, -3, ?, ???).)

This gets slightly more involved for timestamp-based queries, where the window might be from 7 DAYS PRECEDING to 1 DAY PRECEDING.

mythrocks avatar Jan 11 '21 23:01 mythrocks

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions[bot] avatar Feb 16 '21 20:02 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar May 17 '21 21:05 github-actions[bot]

@harrism @mythrocks do either of you recall the motivations for this discussion or know if it's still something that we should be pursuing?

vyasr avatar Oct 17 '22 23:10 vyasr

I'm sorry, I don't.

harrism avatar Oct 18 '22 02:10 harrism

At the risk of repeating what @codereport covered in the description, I can try provide a summary:

Window bounds are defined in CUDF via a "preceding" and "following" row-count. For reasons that predate my work on window functions, the "preceding" count includes the current row. This is not intuitive:

  1. A window defined as [1,1] should intuitively cover 3 rows (1 preceding, 1 current, 1 following), as in SQL. But as per CUDF convention, it covers 2 (1 current, 1 following).
  2. A window of [0,1] is now a negative offset. :/

The Spark code currently passes preceding - 1 before calling into CUDF, to match the semantics.

@codereport's suggestion is to undo this, and make "preceding" "actually preceding" again. But this needs to be done in lockstep with Spark (and Python), to reduce errors. There are also a metric tonne of tests to modify.

mythrocks avatar Oct 19 '22 18:10 mythrocks

I definitely won't stand in your way. :)

harrism avatar Oct 19 '22 20:10 harrism

@mythrocks would you be up for taking this on at some point? I'm happy to help make the Python changes if you'd like help there!

vyasr avatar Oct 21 '22 15:10 vyasr