fst cbind table

Hi I have datasets that is ~1TB for each projects which is too large for a 32GB RAM. I was wondering if there's a way to cbind fst table in a hybrid way?

Basically implement

$<-.fst_table

that append the new column to existing file?

May 24 '19 18:05 dipterix

duplicated with #153

May 25 '19 01:05 xiaodaigh

Hi @dipterix, thanks for your question!

Yes, the ability to column- and row-bind multiple fst files into a single file is planned but not implemented yet. I'm interested in the details of your specific use-case; is your source data already in a series of fst files, of do you have the data in an alternate source such as a remote database or a batch of csv files?

Most analyses can be split into multiple sub-analyses that each use only a limited amount of columns, which are combined at a later stage. Is there a specific reason why having all data in a single fst file is important for your case?

thanks!

May 25 '19 19:05 MarcusKlik

@MarcusKlik I have data in other formats, but also have fst format as cache. For description of my current data, think of time series. My data has 1000 columns, each column is a 1GB signal and can be preprocessed in parallel. However, when I read data, I'm always reading the same rows across the dataset. I thought it might be faster to load these rows if lower end code supports it.

For example, my experiment has 300 trials, each trial ii starts randomly (cannot be determined before slicing, hence I must store the whole data and slice them on the fly), but lasting for 20000 rows. What I want to do is to read 250 columns from all columns each time and reshape to a 300 x 20000 x 250 tensor into RAM.

Right now I store each column as one file and here are two ways I tried to load data:

For each of these 250 columns, for each trial, load a 20000 vector using fst::read_fst(..., from, to)
For each column, load all 1GB data, slice them into a 300 x 20000 matrix

For one single column, the second method is faster. However, when I try to load more columns, R triggers gc(), which takes ~400 ms later on for each columns.

Therefore I switched to the first method, but that's slow too. I guess this is because I have to switch between different files and instead of reading from 250 files, I'm reading from 250 x 300 files?

In summary, my real problem is there's no proper way to load such a large file in the high-end code in R. Either I have to find ways to control gc manually, or I have to seek for low-end implementation/improvement. Also, one-file structure is easier to share and I can always assume those columns exist. Right now if someone removes one file, the reading process (takes minutes) raise errors. It might be their fault removing files, but I think one file is a win-win.

I'll have the speed test results attached later :)

May 28 '19 15:05 dipterix

Hi @dipterix, thanks a lot for taking the time to explain your analysis!

The fst format is a columnar format, so reading as many rows as possible during a single read should be the fastest solution (your second option).

From your answer I understand that garbage collection of a single large vector (plus your sliced matrix) takes a relatively long time. Longer than garbage collection of the 250 x 20000 datasets that are loaded with option 1?

Sometimes garbage collection seems to take longer because the RAM is almost fully exhausted. To test that, you could read slightly smaller datasets in memory and force garbage collection after each step (by using gc() after each read). That way, you'll be sure the OS is not using any swap space to increase the amount of virtual memory.

Although having all data in a single fst file will be more convenient as you say (for sharing and completeness), I think the speed gains from accessing a single file as compared to multiple files will be minimal in your particular case.

What would really help you I think is a single buffer that you can use multiple times. Something like:

# write some dataset with 1e8 rows
data <- data.frame(X = sample(1:10, 1e8, replace = TRUE), Y = sample(1:100, 1e8, replace = TRUE))
fst::write_fst(data, "data.fst")

# read first chunk
buffer <- fst::read_fst("data.fst", 1, 1e7)

# do some magic here with the buffer contents

for (chunk in 1:9) {
  buffer <- fst::insert_fst(buffer, "data.fst", chunk * 1e7 + 1, (chunk + 1) * 1e7)
  
  # do some magic here with the buffer contents
}

The idea behind fst_insert() would be that existing vectors are overwritten in memory, so no garbage collection will be required between reads, significantly speeding up your experiment (now, the garbage collection takes as much time as the 1 GB reads).

In short, I think when working with such large datasets, row- and column- binding features for writing the data to disk will be very useful. And for reading, a method like fst_insert() could help a lot keeping memory requirements low.

(I realize this won't help you too much now while you need it :-))

Thanks!

May 29 '19 21:05 MarcusKlik

That'll be great. On the other hand I was wondering how hard it'll be to use rows instead of having from and to in read_fst and support readings non-consecutive but ordered rows in the C/C++ end?

Like:

fst::read_fst('data.fst', rows = c(1:100, 1004:2000))

Subsetting in C++ might be faster than R and much lighter.

May 30 '19 00:05 dipterix

Hi @dipterix, yes, using a row selector during reading will be faster and much more memory efficient. Your question is also related to this issue which is about the use of a row selector for reading to and writing from disk (both are very useful).

That feature request is about random row selectors. For reading, ordered row selectors (like yours) are somewhat easier to implement, because the row index doesn't have to be sorted first before reading from disk.

Going further, it might also be useful to detect ordered sequences (like your 1:100), to be able to read data even more efficiently:

# note: windows only at the moment (some ALTREP functions are not
# exported on linux/OSX), requires R >= 3.6
devtools::install_github("fstpackage/lazyvec", ref = "develop")

# define ordered integer sequence
rows <- 1:100

# for R >= 3.5, this is implemented as an ALTREP vector
lazyvec::is_altrep(rows)
#> [1] TRUE

# with base ALTREP class 'compact_intseq'
lazyvec::altrep_class(rows)
#> [1] "compact_intseq"

Ordered sequences like 1:100 are implemented as ALTREP vectors for R >= 3.5 and can be detected as such. If fst would read the ALTREP metadata, than e.g.

fst::read_fst('data.fst', rows = 5:100)

can be (internally) converted to

fst::read_fst('data.fst', from = 5, to = 100)

increasing efficiency. Combining multiple ALTREP sequences won't work however:

# combine two ALTREP vectors into a single vector
rows <- c(1:100, 1004:2000)

# the result is not an ALTREP anymore (full expansion)
lazyvec::is_altrep(rows)
#> [1] FALSE

So in your case, a full scan (C++) of the row selector is needed to determine which rows to drop. But as you say, that will still be much faster than the current implementation where a larger vector is read and subsetted afterwards:

fst::fst("data.fst")[c(1:100, 1004:2000), ]

thanks, implementing row-selection during reading and writing is high on the priority list!

May 30 '19 21:05 MarcusKlik

For those who are interested in how I compare these two ways of loading data, here are profiling.

Subset data when reading via `fst`

#### Subset using read_fst(..., from, to)
profvis::profvis({
  print(system.time({
    vapply(row_start, function(idx){
      as.matrix(fst::read_fst(file, from = idx, to = idx + 199))
    }, FUN.VALUE = re)
  }))
})
#>   user  system elapsed 
#>   0.075   0.027   0.102

This method almost instantly finished.

Load all data and then subset for each column

In this scenario, suppose I can only hold one column in RAM at a time (for example, I generate a 200GB data on a server with 80 columns and client only has 4GB RAM).

#### Subset in-memory
profvis::profvis({
  print(system.time({
    lapply(seq_len(10), function(ii){
      fst = fst::read_fst(file)
      vapply(row_start[ii, ], function(idx){
        as.matrix(fst[idx: (idx + 199),])
      }, FUN.VALUE = re)
    })
  }))
})
#>   user  system elapsed 
#>   345.159  63.991  57.580

This one took almost one minutes.

Profiling timer seems incorrect. I guess this is because I ran with 8 threads, so time shows 8x as it should be. gc() is expensive in R (~4 sec total) compared to previous (~20 ms total). And the data I'm targeting is 1000x larger. In addition, memory changes a lot.

Here is how I generated test data.

#### fst reading test
x = rnorm(3e8); dim(x) = c(3e6, 100); x = as.data.frame(x)
pryr::object_size(x)
#> 2.4 GB

#### Create a temp file for testing
file = tempfile(); fst::write_fst(x, file, compress = 100)

#### Generate row indices
row_start = sample(3e6-100, 20)
re = array(0, c(200, 100))

Jun 03 '19 05:06 dipterix

Hi @dipterix, thanks for sharing your benchmarks!

Yes, system.time() seems to take into account the time spent by all the threads, so that's not very useful for multi-threaded code! Package microbenchmark does not have these problems and will yield better results (with more resolution).

From your code, it seems that copying takes a long time. And if memory is almost full, it can take even longer because of R's relatively slow garbage collection. In the test below I compare the speed of reading from a fst file to taking a subset. Copying seems a factor of 2 faster (as you would expect), but I still have plenty of RAM left...

# fst reading test
x = data.frame(X = sample(1:1000, 1e8, replace = TRUE))

file = tempfile()
fst::write_fst(x, file)

random_rows <- sample(1e8, 100)


# this will result in 500 loads
microbenchmark::microbenchmark({
  lapply(1:100, function(z) {
    y <- fst::read_fst(file, from = random_rows[z], to = random_rows[z] + 999999)
  })
}, times = 5)
#> Unit: milliseconds
#>                                                                                                                                           expr
#>  {     lapply(1:100, function(z) {         y <- fst::read_fst(file, from = random_rows[z], to = random_rows[z] +              999999)     }) }
#>       min       lq     mean   median       uq      max neval
#>  704.9919 708.8402 838.7643 863.5971 890.9763 1025.416     5

fst_data <- fst::read_fst(file)

# this will result in 500 copies
microbenchmark::microbenchmark({
  lapply(1:100, function(z) {
    y <- fst_data[random_rows[z]:(random_rows[z] + 999999),]
  })
}, times = 5)
#> Unit: milliseconds
#>                                                                                                                       expr
#>  {     lapply(1:100, function(z) {         y <- fst_data[random_rows[z]:(random_rows[z] + 999999),              ]     }) }
#>       min       lq     mean   median       uq      max neval
#>  385.3657 434.4636 429.8669 440.5715 441.5131 447.4206     5

It could also be that in your second example, R has more trouble allocating memory because it needs to allocate RAM for the complete dataset. When you read smaller chunks from disk, you only need to allocate smaller RAM sections. And when RAM is almost full, that might take more time.

In the end, I think reading the complete set into memory and then using chunks from that will be slower than just reading the smaller chunks directly. The overhead for file access is small, especially for chunks with medium size.

Perhaps it's possible to write your source data as a single column instead of casting the matrix to a data.frame (a matrix is just a single large vector with some meta-data). That would avoid all the casting and might speed things up?

Jun 03 '19 11:06 MarcusKlik

In my test, I load one column and subset each time. Think if the file is ~200GB with 80 columns (2.5GB per column) and your RAM is only 4GB, you can only load one column at a time. Loading all the data and then subset is the fastest but that's not possible in this scenario, hence fst_data <- fst::read_fst(file) will blow up the RAM.

I think it's the column selection that slow down the whole process.

Jun 03 '19 15:06 dipterix

@MarcusKlik thank you for developing such great package and appreciate for the recently published fstcore package. I solved my issue by implementing cpp level control with fstcore (lazyarray). This issue can be closed.

Jun 24 '20 08:06 dipterix

Hi @dipterix, thanks for the heads up and great to hear that direct usage of the fstcore API works for you!

Jun 24 '20 09:06 MarcusKlik

fst fst copied to clipboard

cbind table

Subset data when reading via fst

Load all data and then subset for each column

fst
fst copied to clipboard

Subset data when reading via `fst`