positron icon indicating copy to clipboard operation
positron copied to clipboard

Error when expanding large R data frame in Variables tab

Open jmcphers opened this issue 1 year ago • 3 comments

Discussed in https://github.com/posit-dev/positron-beta/discussions/220

Originally posted by mikemahoney218 June 20, 2024 Hi all,

When clicking to expand a large data frame in the Variables tab, I get the following error: image

What happens next seems to vary; I've sometimes had this freeze the console so all input is ignored, and sometimes I've had the arrow in the Variables pane then turn down, even though it hasn't been expanded: image

If the console freezes, I need to restart Positron entirely, not just the R session, to get it back.

jmcphers avatar Jun 22 '24 00:06 jmcphers

I can reproduce this by creating an enormous dataset -- try repeatedly doubling the rows and columns of the nycflights13::flights dataset until it is about the same size as Mike's dataset. It won't expand in the Variables pane any longer.

> foo <- nycflights13::flights
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- cbind(foo, foo)
image

jmcphers avatar Jun 22 '24 00:06 jmcphers

Reported again in https://github.com/posit-dev/positron/issues/3833.

jmcphers avatar Jul 03 '24 14:07 jmcphers

Reported on Mastodon too: https://mastodon.online/@[email protected]/112966300026506969

jmcphers avatar Aug 15 '24 16:08 jmcphers

Reported again in https://github.com/posit-dev/positron/issues/5270 with this example:

# Need to do this first
# devtools::install_github('satijalab/seurat-data')
# SeuratData::InstallData("pbmc3k")
library(Seurat)
library(SeuratData)

pbmc3k.final <- LoadData("pbmc3k", type = "pbmc3k.final")

print("hi")

Do str(pbmc3k.final) to see this enormous, complicated object that hangs the Variables pane.

juliasilge avatar Nov 07 '24 18:11 juliasilge

Hi, I also just wanted to add that I ran into the same issue and suggest a maybe even easier way to reproduce the problem without downloading some extra files:

system.time({
  mat <- matrix(0, nrow = 1000, ncol = 1e5)
  sce <- SingleCellExperiment::SingleCellExperiment(list(mat = mat))
})
print("hi")

The creation of the SingleCellExperiment object takes 80ms on my computer, the console then freezes for 30 seconds before printing "hi".


And also just to add, when I run the same code in a jupyter console with the ark kernel, there is no additional delay

const-ae avatar Nov 11 '24 14:11 const-ae

Reported again in https://github.com/posit-dev/positron/discussions/5333.

jmcphers avatar Nov 12 '24 16:11 jmcphers

Still able to repro with Jonathan's example:

> foo <- nycflights13::flights
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- rbind(foo, foo)
> foo <- cbind(foo, foo)

with: Positron Version: 2024.12.0 (Universal) build 66

testlabauto avatar Nov 25 '24 19:11 testlabauto

The existing problem with flights is that we are very slowly formatting the POSIXct columns. We can see that removing the POSIXct columns fixes the problem:

f <- foo[, !sapply(foo, \(x) inherits(x, "POSIXct"))]

Probably because of https://github.com/posit-dev/ark/blob/f3fffe3d41bf0b33a7e34097c05f3b685b6bfaea/crates/harp/src/vector/formatted_vector.rs#L122-L129

We'll need a better way to iterate over an object that doesn't require formatting the whole column.

dfalbel avatar Nov 25 '24 19:11 dfalbel

Just out of curiosity, I’m not a professional, just offering a suggestion. Would it be possible to solve most of similar issues by just simply capturing the output of str()? This way, it could achieve complete compatibility with most scenarios and any data format (whether it’s a matrix, S4, or R6). Alternatively, could the same formatting approach behind str() be used to obtain a summary of the data? I noticed that the variable pane in RStudio seems to achieve this in a similar way as well (I’m not sure,but I’ve observed that RStudio’s display is identical to the output of str()).

foo <- nycflights13::flights
foo <- rbind(foo, foo)
foo <- rbind(foo, foo)
foo <- rbind(foo, foo)
foo <- rbind(foo, foo)
foo <- cbind(foo, foo)
str(foo)
#> 'data.frame':    5388416 obs. of  38 variables:
#>  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
#>  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
#>  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
#>  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#>  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
#>  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
#>  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
#>  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
#>  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
#>  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
#>  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
#>  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
#>  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
#>  $ distance      : num  1400 1416 1089 1576 762 ...
#>  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
#>  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
#>  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#>  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
#>  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
#>  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
#>  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
#>  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
#>  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
#>  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
#>  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
#>  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
#>  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
#>  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
#>  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
#>  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
#>  $ distance      : num  1400 1416 1089 1576 762 ...
#>  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
#>  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
#>  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

Created on 2024-11-26 with reprex v2.0.2

In RStudio Variable Pane: Image

ZhimingYe avatar Nov 26 '24 13:11 ZhimingYe

There are pros and cons of using str for displaying the info. The main feature we want to support and using str makes it harder is adding options to expand nested items. For instance, in RStudio you either expand the whole list or not, while in Positron you can selectively expand nested elements too.

Another problem with str is that there's no real guarantee that it won't do a lot of work too. Since it's an s3 generic, it could be that a class didn't nicely implement it and it takes a lot of time to execute. Dependning on the implementation it may also materialize ALTREP objects, etc too.

dfalbel avatar Nov 26 '24 14:11 dfalbel

@dfalbel Thank you for the clarification! I understand and respect your decision. I truly appreciate all the effort and dedication you’ve put into ARK.

That said, some of RStudio’s approaches might be worth considering, such as skipping particularly large objects, setting rendering timeouts (like the in valueFromStr setted withTimeLimit , and interrupting operations when the timeout is exceeded. While these methods are relatively straightforward and not very refined, they greatly enhance the user experience.

https://github.com/rstudio/rstudio/pull/47

https://github.com/rstudio/rstudio/blob/main/src/cpp/session/modules/SessionEnvironment.R

.rs.addFunction("valueFromStr", function(val)
{
  .rs.withTimeLimit(1, fail = "<truncated>", {
    capture.output(try(str(val), silent = TRUE))
  })
  
})

...

if (is.symbol(obj))
{
  val <- as.character(obj)
}
else if (is.language(obj))
{
  val <- .rs.describeCall(obj)
}
else if (!hasNullPtr)
{
  # for large objects (> half MB), don't try to get the value, just show
  # the size. Some functions (e.g. str()) can cause the object to be
  # copied, which is slow for large objects.
  if (size > 524288)
  {
    len_desc <- if (len > 1)
    {
      paste(len, " elements, ", sep = "")
    }
    else
    {
      ""
    }
    
    # data frames are likely to be large, but a summary is still helpful
    if (is.data.frame(obj))
    {
      val <- "NO_VALUE"
      desc <- .rs.valueDescription(obj)
    }
    else
    {
      fmt <- "Large %s (%s %s)"
      val <- sprintf(fmt, class, len_desc, format(size, units = "auto", standard = "SI"))
    }
    contents_deferred <- TRUE
  }
  else
  {
    val <- .rs.valueAsString(obj)
    desc <- .rs.valueDescription(obj)
    
    # expandable object--supply contents
    if (is.list(obj) ||  is.data.frame(obj) || isS4(obj) ||
        inherits(obj, c("data.table", "ore.frame", "cast_df", "xts", "DataFrame")))
    {
      if (computeSize)
      {
        # normal object
        contents <- .rs.valueContents(obj)
      }
      else
      {
        # don't prefetch content for altreps
        val <- "NO_VALUE"
        contents_deferred <- TRUE
      }
    }
  }
}

In most cases, users only need a rough preview (especially the data hierarchy). For more detailed exploration, they are more likely to subset the object to inspect specific items or use tools like Data Explorer, rather than being limited to the variable pane. :)

ZhimingYe avatar Nov 26 '24 15:11 ZhimingYe

Thanks @ZhimingYe We definitely plan to implement similar mechanisms in Positron. See e.g. https://github.com/posit-dev/positron/issues/1419#issuecomment-1810222809. We still don't have the entire infrastructure for that though.

In the meantime, the fixes we have pushed will probably solve most of the problems, as a lot of the performance cost was caused by bugs in the formatting routine, triggering unnecessarily large computations.

dfalbel avatar Nov 27 '24 13:11 dfalbel

Verified Fixed

Positron Version(s) : 2025.01.0-124
OS Version          : OSX

Test scenario(s)

Verified with Jonathans repro steps. Looks great.

Link(s) to TestRail test cases run or created:

testlabauto avatar Dec 19 '24 15:12 testlabauto

Hello,

Just letting you know that I started getting this error after installing the latest version on my system.

Error refreshing variables: RPC timed out after 5 seconds: {"jsonrpc":"2.0","method":"list"}

Error (-32603)
Positron Version: 2025.01.0 (system setup) build 152
Code - OSS Version: 1.95.0
Commit: 66aa3fb7f98eb8d19155cb7220856154f6ede8b3
Date: 2025-01-06T02:53:20.465Z
Electron: 32.2.1
Chromium: 128.0.6613.186
Node.js: 20.18.0
V8: 12.8.374.38-electron.0
OS: Windows_NT x64 10.0.26100

I did not see this error before with the previous version

pieterjanvc avatar Jan 09 '25 16:01 pieterjanvc

This new error you are seeing @pieterjanvc is https://github.com/posit-dev/positron/issues/5813 and the fix will be included in the next release. Thanks for reporting!

juliasilge avatar Jan 09 '25 19:01 juliasilge