h2o-3
h2o-3 copied to clipboard
h2o.arrange not working on long vector
I was attempting to sort a real dataset after a merge, to get it back to the original row order. I found that for really large datasets this sorting does not work.
{code:r}library(h2o) library(data.table)
h2o.init()
create a 300m row random real column
df <- h2o.createFrame(rows = 3e8, cols = 1, missing_fraction = 0)
create a row index to be used as a comparison for the sort
df$row_index <- 1 df$row_index <- h2o.cumsum(df$row_index)
sort the data by that random real
df_sort <- h2o.arrange(df, "C1")
attempt to get back to the original sorting by sorting on row index
df_resort <- h2o.arrange(df_sort, "row_index")
compare the row index, row by row, to the sorted
this should return an h2o frame with 0 rows but it does not
h2o.which(df$row_index != df_resort$row_index)
C1
1 2097151
2 2097152
3 2097153
4 2097154
5 2097155
6 2097156
[266338122 rows x 1 column] {code}
I attempted to find approximately the threshold where it stops working. At least on my system, it didn’t work starting around 2.685e8 rows, values below that seemed to sort just fine. Interestingly - the first row where the match is not valid was always right around the number you see above, 2,097,151.
JIRA Issue Details
Jira Issue: PUBDEV-8600 Assignee: Thomas Brady Reporter: Paul Donnelly State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
I'm the original reporter on Jira. Reposting with better formatting.
I was attempting to sort a real dataset after a merge, to get it back to the original row order. I found that for really large datasets this sorting does not work.
library(h2o)
library(data.table)
h2o.init()
# create a 300m row random real column
df <- h2o.createFrame(rows = 3e8, cols = 1, missing_fraction = 0)
# create a row index to be used as a comparison for the sort
df$row_index <- 1
df$row_index <- h2o.cumsum(df$row_index)
# sort the data by that random real
df_sort <- h2o.arrange(df, "C1")
# attempt to get back to the original sorting by sorting on row index
df_resort <- h2o.arrange(df_sort, "row_index")
# compare the row index, row by row, to the sorted
# this should return an h2o frame with 0 rows but it does not
h2o.which(df$row_index != df_resort$row_index)
# C1
# 1 2097151
# 2 2097152
# 3 2097153
# 4 2097154
# 5 2097155
# 6 2097156
# [266338122 rows x 1 column]
I attempted to find approximately the threshold where it stops working. At least on my system, it didn’t work starting around 2.685e8 rows, values below that seemed to sort just fine. Interestingly - the first row where the match is not valid was always right around the number you see above, 2,097,151.
Here is the result I get from running @hutch3232 code:
h2o.which(df$row_index != df_resort$row_index) C1 1 2097154 2 2097155 3 2097156 4 2097157 5 2097158 6 2097159
The problem is worse than i thought. For a frame of 269000001 rows, 266338103 rows are sorted wrong. Here is an example of the wrongly sorted rows. First one is row index, correct row content, wrong row content:
However, using h2o-3.14.0.1, I was able to test and find out that the integer sort is still correct at this point. The thing here is to find out where the two versions differ.