positron
positron copied to clipboard
Error when expanding large R data frame in Variables tab
Discussed in https://github.com/posit-dev/positron-beta/discussions/220
Originally posted by mikemahoney218 June 20, 2024 Hi all,
When clicking to expand a large data frame in the Variables tab, I get the following error:
What happens next seems to vary; I've sometimes had this freeze the console so all input is ignored, and sometimes I've had the arrow in the Variables pane then turn down, even though it hasn't been expanded:
If the console freezes, I need to restart Positron entirely, not just the R session, to get it back.
I can reproduce this by creating an enormous dataset -- try repeatedly doubling the rows and columns of the nycflights13::flights dataset until it is about the same size as Mike's dataset. It won't expand in the Variables pane any longer.
> foo <- nycflights13::flights
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- cbind(foo, foo)
Reported again in https://github.com/posit-dev/positron/issues/3833.
Reported on Mastodon too: https://mastodon.online/@[email protected]/112966300026506969
Reported again in https://github.com/posit-dev/positron/issues/5270 with this example:
# Need to do this first
# devtools::install_github('satijalab/seurat-data')
# SeuratData::InstallData("pbmc3k")
library(Seurat)
library(SeuratData)
pbmc3k.final <- LoadData("pbmc3k", type = "pbmc3k.final")
print("hi")
Do str(pbmc3k.final) to see this enormous, complicated object that hangs the Variables pane.
Hi, I also just wanted to add that I ran into the same issue and suggest a maybe even easier way to reproduce the problem without downloading some extra files:
system.time({
mat <- matrix(0, nrow = 1000, ncol = 1e5)
sce <- SingleCellExperiment::SingleCellExperiment(list(mat = mat))
})
print("hi")
The creation of the SingleCellExperiment object takes 80ms on my computer, the console then freezes for 30 seconds before printing "hi".
And also just to add, when I run the same code in a jupyter console with the ark kernel, there is no additional delay
Reported again in https://github.com/posit-dev/positron/discussions/5333.
Still able to repro with Jonathan's example:
> foo <- nycflights13::flights
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- cbind(foo, foo)
with: Positron Version: 2024.12.0 (Universal) build 66
The existing problem with flights is that we are very slowly formatting the POSIXct columns.
We can see that removing the POSIXct columns fixes the problem:
f <- foo[, !sapply(foo, \(x) inherits(x, "POSIXct"))]
Probably because of https://github.com/posit-dev/ark/blob/f3fffe3d41bf0b33a7e34097c05f3b685b6bfaea/crates/harp/src/vector/formatted_vector.rs#L122-L129
We'll need a better way to iterate over an object that doesn't require formatting the whole column.
Just out of curiosity, I’m not a professional, just offering a suggestion. Would it be possible to solve most of similar issues by just simply capturing the output of str()? This way, it could achieve complete compatibility with most scenarios and any data format (whether it’s a matrix, S4, or R6). Alternatively, could the same formatting approach behind str() be used to obtain a summary of the data? I noticed that the variable pane in RStudio seems to achieve this in a similar way as well (I’m not sure,but I’ve observed that RStudio’s display is identical to the output of str()).
foo <- nycflights13::flights
foo <- rbind(foo, foo)
foo <- rbind(foo, foo)
foo <- rbind(foo, foo)
foo <- rbind(foo, foo)
foo <- cbind(foo, foo)
str(foo)
#> 'data.frame': 5388416 obs. of 38 variables:
#> $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
#> $ month : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ day : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
#> $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
#> $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#> $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
#> $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
#> $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
#> $ carrier : chr "UA" "UA" "AA" "B6" ...
#> $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
#> $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
#> $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
#> $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
#> $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
#> $ distance : num 1400 1416 1089 1576 762 ...
#> $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
#> $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
#> $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#> $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
#> $ month : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ day : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
#> $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
#> $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#> $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
#> $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
#> $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
#> $ carrier : chr "UA" "UA" "AA" "B6" ...
#> $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
#> $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
#> $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
#> $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
#> $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
#> $ distance : num 1400 1416 1089 1576 762 ...
#> $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
#> $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
#> $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
Created on 2024-11-26 with reprex v2.0.2
In RStudio Variable Pane:
There are pros and cons of using str for displaying the info. The main feature we want to support and using str makes it harder is adding options to expand nested items. For instance, in RStudio you either expand the whole list or not, while in Positron you can selectively expand nested elements too.
Another problem with str is that there's no real guarantee that it won't do a lot of work too. Since it's an s3 generic, it could be that a class didn't nicely implement it and it takes a lot of time to execute. Dependning on the implementation it may also materialize ALTREP objects, etc too.
@dfalbel Thank you for the clarification! I understand and respect your decision. I truly appreciate all the effort and dedication you’ve put into ARK.
That said, some of RStudio’s approaches might be worth considering, such as skipping particularly large objects, setting rendering timeouts (like the in valueFromStr setted withTimeLimit , and interrupting operations when the timeout is exceeded. While these methods are relatively straightforward and not very refined, they greatly enhance the user experience.
https://github.com/rstudio/rstudio/pull/47
https://github.com/rstudio/rstudio/blob/main/src/cpp/session/modules/SessionEnvironment.R
.rs.addFunction("valueFromStr", function(val)
{
.rs.withTimeLimit(1, fail = "<truncated>", {
capture.output(try(str(val), silent = TRUE))
})
})
...
if (is.symbol(obj))
{
val <- as.character(obj)
}
else if (is.language(obj))
{
val <- .rs.describeCall(obj)
}
else if (!hasNullPtr)
{
# for large objects (> half MB), don't try to get the value, just show
# the size. Some functions (e.g. str()) can cause the object to be
# copied, which is slow for large objects.
if (size > 524288)
{
len_desc <- if (len > 1)
{
paste(len, " elements, ", sep = "")
}
else
{
""
}
# data frames are likely to be large, but a summary is still helpful
if (is.data.frame(obj))
{
val <- "NO_VALUE"
desc <- .rs.valueDescription(obj)
}
else
{
fmt <- "Large %s (%s %s)"
val <- sprintf(fmt, class, len_desc, format(size, units = "auto", standard = "SI"))
}
contents_deferred <- TRUE
}
else
{
val <- .rs.valueAsString(obj)
desc <- .rs.valueDescription(obj)
# expandable object--supply contents
if (is.list(obj) || is.data.frame(obj) || isS4(obj) ||
inherits(obj, c("data.table", "ore.frame", "cast_df", "xts", "DataFrame")))
{
if (computeSize)
{
# normal object
contents <- .rs.valueContents(obj)
}
else
{
# don't prefetch content for altreps
val <- "NO_VALUE"
contents_deferred <- TRUE
}
}
}
}
In most cases, users only need a rough preview (especially the data hierarchy). For more detailed exploration, they are more likely to subset the object to inspect specific items or use tools like Data Explorer, rather than being limited to the variable pane. :)
Thanks @ZhimingYe We definitely plan to implement similar mechanisms in Positron. See e.g. https://github.com/posit-dev/positron/issues/1419#issuecomment-1810222809. We still don't have the entire infrastructure for that though.
In the meantime, the fixes we have pushed will probably solve most of the problems, as a lot of the performance cost was caused by bugs in the formatting routine, triggering unnecessarily large computations.
Verified Fixed
Positron Version(s) : 2025.01.0-124
OS Version : OSX
Test scenario(s)
Verified with Jonathans repro steps. Looks great.
Link(s) to TestRail test cases run or created:
Hello,
Just letting you know that I started getting this error after installing the latest version on my system.
Error refreshing variables: RPC timed out after 5 seconds: {"jsonrpc":"2.0","method":"list"}
Error (-32603)
Positron Version: 2025.01.0 (system setup) build 152
Code - OSS Version: 1.95.0
Commit: 66aa3fb7f98eb8d19155cb7220856154f6ede8b3
Date: 2025-01-06T02:53:20.465Z
Electron: 32.2.1
Chromium: 128.0.6613.186
Node.js: 20.18.0
V8: 12.8.374.38-electron.0
OS: Windows_NT x64 10.0.26100
I did not see this error before with the previous version
This new error you are seeing @pieterjanvc is https://github.com/posit-dev/positron/issues/5813 and the fix will be included in the next release. Thanks for reporting!