fst
fst copied to clipboard
support matrix write/read
How easy (or difficult) is it to support matrix IO ? I know I can coerse the data.frame from read.fst to matrix. But I wonder if there is more efficient way of doing it at fst IO level?
Hi @mikejiang, thanks for your feature request! For reading and writing complete matrices to disk, the existing framework can be used. So supporting that would not be very difficult. For example:
m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)
# matrix is just a vector with a 'dim' attribute
attributes(m)
#> $dim
#> [1] 1000 1000
# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
# store and remove dims attribute
dim <- attr(m, "dim")
attr(m, "dim") <- NULL
fst::write_fst(data.frame(Data = m), file_name)
saveRDS(dim, paste0(file_name, ".dim")) # serialize dim
}
# equivalent method to support reading full matrices
read_fst_matrix <- function(file_name) {
ft <- fst::read_fst(file_name) # single column data.frame
dim <- readRDS(paste0(file_name, ".dim")) # retrieve dim
m <- ft[[1]]
attr(m, "dim") <- dim
m
}
# write matrix efficiently
write_fst_matrix(m, "1.fst")
# read matrix efficiently
m <- read_fst_matrix("1.fst")
(the dim
data will be stored in the fst
file, but this is just an equivalent example)
In this example, the underlying vector of a matrix
is serialized. So the on-disk file has identical memory layout as the in-memory vector data in the matrix
.
But as we want to allow random access in both columns and rows, things get slightly more complicated. For example, suppose we want to take a subset of the rows of a matrix
. When reading that data from disk, fst
has to perform a seek operation for each column in the matrix (as it needs to skip some data). Seek operations are relatively expensive and matrices tend to have much more columns than data.frames
(so much more seek operations).
A more optimized way of storing the data would be in 2-dimensional blocks (see this comment of @PeteHaitch). Internally, fst
uses 16 kB blocks of data (e.g. 4096 integers). If each of these blocks would represent a 256 x 16 piece of the original (integer) matrix
, a factor 16 less seeks would be required. But on the other side that would require shuffling the data in memory.
So the most optimal way to organize the data for storing a full vector is identical to the way the data is stored now for data.frames
. But for optimal random access matrices a blocked format would be preferred.
Do your matrices usually have much more rows than columns? The average dimensions would determine the optimal size of the chosen blocks (or perhaps they can be determined from the dim
attribute when writing).
thanks
The rownames and colnames lost in the above write_fst_matrix and read_fst_matrix, how to preserve them?
Hi @ccshao, thanks for your question!
The example above could be expanded to include column- and row-names:
# equivalent method to support writing full matrices
write_fst_matrix <- function(m, file_name) {
# store and remove dims attribute
dim <- attr(m, "dim")
meta_data <- list(
dim = dim,
colnames = colnames(m),
rownames = rownames(m)
)
# serialize tale and meta data
attr(m, "dim") <- NULL
fst::write_fst(data.frame(Data = m), file_name)
saveRDS(meta_data, paste0(file_name, ".meta"))
}
# equivalent method to support reading full matrices
read_fst_matrix <- function(file_name) {
ft <- fst::read_fst(file_name) # single column data.frame
meta_data <- readRDS(paste0(file_name, ".meta")) # retrieve dim
m <- ft[[1]]
attr(m, "dim") <- meta_data$dim
colnames(m) <- meta_data$colnames
rownames(m) <- meta_data$rownames
m
}
# define matrix
m <- matrix(sample(1:100, 1000000, replace = TRUE), nrow = 1000)
colnames(m) <- sample(LETTERS, 1000, replace = TRUE)
rownames(m) <- sample(LETTERS, 1000, replace = TRUE)
# write matrix efficiently
write_fst_matrix(m, "1.fst")
# read matrix efficiently
m <- read_fst_matrix("1.fst")
#> Loading required namespace: data.table
# result is a matrix
is.matrix(m)
#> [1] TRUE
# with col- and row- names
head(colnames(m))
#> [1] "W" "C" "L" "E" "R" "M"
head(rownames(m))
#> [1] "K" "B" "Q" "H" "Z" "Y"
(Note that I didn't include any error checking on the existence of files or the correctness of returned meta-data).
But as discussed above, it would be much better and more efficient if fst
could serialize a matrix internally using a blocked format for faster random access...
Hope this helps you, at least until the matrix API is available in the fst
package!
Thanks, This works very well!