h2o-3 icon indicating copy to clipboard operation
h2o-3 copied to clipboard

h2o.arrange not working on long vector

Open exalate-issue-sync[bot] opened this issue 2 years ago • 5 comments

I was attempting to sort a real dataset after a merge, to get it back to the original row order. I found that for really large datasets this sorting does not work.

{code:r}library(h2o) library(data.table)

h2o.init()

create a 300m row random real column

df <- h2o.createFrame(rows = 3e8, cols = 1, missing_fraction = 0)

create a row index to be used as a comparison for the sort

df$row_index <- 1 df$row_index <- h2o.cumsum(df$row_index)

sort the data by that random real

df_sort <- h2o.arrange(df, "C1")

attempt to get back to the original sorting by sorting on row index

df_resort <- h2o.arrange(df_sort, "row_index")

compare the row index, row by row, to the sorted

this should return an h2o frame with 0 rows but it does not

h2o.which(df$row_index != df_resort$row_index)

C1

1 2097151

2 2097152

3 2097153

4 2097154

5 2097155

6 2097156

[266338122 rows x 1 column] {code}

I attempted to find approximately the threshold where it stops working. At least on my system, it didn’t work starting around 2.685e8 rows, values below that seemed to sort just fine. Interestingly - the first row where the match is not valid was always right around the number you see above, 2,097,151.

exalate-issue-sync[bot] avatar Feb 21 '23 22:02 exalate-issue-sync[bot]

JIRA Issue Details

Jira Issue: PUBDEV-8600 Assignee: Thomas Brady Reporter: Paul Donnelly State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A

h2o-ops avatar May 10 '23 13:05 h2o-ops

I'm the original reporter on Jira. Reposting with better formatting.


I was attempting to sort a real dataset after a merge, to get it back to the original row order. I found that for really large datasets this sorting does not work.

library(h2o)
library(data.table)

h2o.init()

# create a 300m row random real column
df <- h2o.createFrame(rows = 3e8, cols = 1, missing_fraction = 0)

# create a row index to be used as a comparison for the sort
df$row_index <- 1
df$row_index <- h2o.cumsum(df$row_index)

# sort the data by that random real
df_sort <- h2o.arrange(df, "C1")

# attempt to get back to the original sorting by sorting on row index
df_resort <- h2o.arrange(df_sort, "row_index")

# compare the row index, row by row, to the sorted
# this should return an h2o frame with 0 rows but it does not
h2o.which(df$row_index != df_resort$row_index)

# C1
# 1 2097151
# 2 2097152
# 3 2097153
# 4 2097154
# 5 2097155
# 6 2097156
# [266338122 rows x 1 column]

I attempted to find approximately the threshold where it stops working. At least on my system, it didn’t work starting around 2.685e8 rows, values below that seemed to sort just fine. Interestingly - the first row where the match is not valid was always right around the number you see above, 2,097,151.

hutch3232 avatar Jul 30 '23 13:07 hutch3232

Here is the result I get from running @hutch3232 code:

h2o.which(df$row_index != df_resort$row_index) C1 1 2097154 2 2097155 3 2097156 4 2097157 5 2097158 6 2097159

wendycwong avatar Jun 05 '24 15:06 wendycwong

The problem is worse than i thought. For a frame of 269000001 rows, 266338103 rows are sorted wrong. Here is an example of the wrongly sorted rows. First one is row index, correct row content, wrong row content: Screenshot 2024-06-12 at 8 56 20 AM

wendycwong avatar Jun 12 '24 15:06 wendycwong

However, using h2o-3.14.0.1, I was able to test and find out that the integer sort is still correct at this point. The thing here is to find out where the two versions differ.

wendycwong avatar Jun 13 '24 22:06 wendycwong