fst icon indicating copy to clipboard operation
fst copied to clipboard

support matrix write/read

Open mikejiang opened this issue 6 years ago • 4 comments

How easy (or difficult) is it to support matrix IO ? I know I can coerse the data.frame from read.fst to matrix. But I wonder if there is more efficient way of doing it at fst IO level?

mikejiang avatar May 21 '18 17:05 mikejiang

Hi @mikejiang, thanks for your feature request! For reading and writing complete matrices to disk, the existing framework can be used. So supporting that would not be very difficult. For example:

m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)

# matrix is just a vector with a 'dim' attribute
attributes(m)
#> $dim
#> [1] 1000 1000

# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
  
  # store and remove dims attribute
  dim <- attr(m, "dim")
  attr(m, "dim") <- NULL
  
  fst::write_fst(data.frame(Data = m), file_name)
  saveRDS(dim, paste0(file_name, ".dim"))  # serialize dim
}

# equivalent method to support reading full matrices
read_fst_matrix <- function(file_name) {
  
  ft <- fst::read_fst(file_name)  # single column data.frame
  dim <- readRDS(paste0(file_name, ".dim"))  # retrieve dim

  m <- ft[[1]]  
  attr(m, "dim") <- dim
  
  m
}

# write matrix efficiently
write_fst_matrix(m, "1.fst")

# read matrix efficiently
m <- read_fst_matrix("1.fst")

(the dim data will be stored in the fst file, but this is just an equivalent example)

In this example, the underlying vector of a matrix is serialized. So the on-disk file has identical memory layout as the in-memory vector data in the matrix.

But as we want to allow random access in both columns and rows, things get slightly more complicated. For example, suppose we want to take a subset of the rows of a matrix. When reading that data from disk, fst has to perform a seek operation for each column in the matrix (as it needs to skip some data). Seek operations are relatively expensive and matrices tend to have much more columns than data.frames (so much more seek operations).

A more optimized way of storing the data would be in 2-dimensional blocks (see this comment of @PeteHaitch). Internally, fst uses 16 kB blocks of data (e.g. 4096 integers). If each of these blocks would represent a 256 x 16 piece of the original (integer) matrix, a factor 16 less seeks would be required. But on the other side that would require shuffling the data in memory.

So the most optimal way to organize the data for storing a full vector is identical to the way the data is stored now for data.frames. But for optimal random access matrices a blocked format would be preferred.

Do your matrices usually have much more rows than columns? The average dimensions would determine the optimal size of the chosen blocks (or perhaps they can be determined from the dim attribute when writing).

thanks

MarcusKlik avatar May 21 '18 21:05 MarcusKlik

The rownames and colnames lost in the above write_fst_matrix and read_fst_matrix, how to preserve them?

ccshao avatar Nov 23 '18 10:11 ccshao

Hi @ccshao, thanks for your question!

The example above could be expanded to include column- and row-names:

# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
  
  # store and remove dims attribute
  dim <- attr(m, "dim")
  
  meta_data <- list(
    dim = dim,
    colnames = colnames(m),
    rownames = rownames(m)
  )

  # serialize tale and meta data
  attr(m, "dim") <- NULL
  fst::write_fst(data.frame(Data = m), file_name)
  saveRDS(meta_data, paste0(file_name, ".meta"))
}

# equivalent method to support reading full matrices
read_fst_matrix <- function(file_name) {
  
  ft <- fst::read_fst(file_name)  # single column data.frame
  meta_data <- readRDS(paste0(file_name, ".meta"))  # retrieve dim
  
  m <- ft[[1]]  
  attr(m, "dim") <- meta_data$dim
  colnames(m) <- meta_data$colnames
  rownames(m) <- meta_data$rownames
  
  m
}

# define matrix
m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)
colnames(m) <- sample(LETTERS, 1000, replace = TRUE)
rownames(m) <- sample(LETTERS, 1000, replace = TRUE)

# write matrix efficiently
write_fst_matrix(m, "1.fst")

# read matrix efficiently
m <- read_fst_matrix("1.fst")
#> Loading required namespace: data.table

# result is a matrix
is.matrix(m)
#> [1] TRUE

# with col- and row- names
head(colnames(m))
#> [1] "W" "C" "L" "E" "R" "M"
head(rownames(m))
#> [1] "K" "B" "Q" "H" "Z" "Y"

(Note that I didn't include any error checking on the existence of files or the correctness of returned meta-data).

But as discussed above, it would be much better and more efficient if fst could serialize a matrix internally using a blocked format for faster random access...

Hope this helps you, at least until the matrix API is available in the fst package!

fstpackage avatar Nov 23 '18 22:11 fstpackage

Thanks, This works very well!

ccshao avatar Nov 24 '18 09:11 ccshao