fst
fst copied to clipboard
Unknown type found in column - "list" support
Is any chance for "list" support ?
library(data.table); library(fst) nr_of_rows <- 100
dt <- data.table( Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE), Integer = sample(1L:100L, nr_of_rows, replace = TRUE), Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE), Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE)) ) cols=c("Logical","Integer","Real") dt <- dt[,lapply(.SD,function(x) list(x)),by=Factor,.SDcols=cols] dt <- melt(dt,id.vars = c("Factor"))
Store the data frame to disk
write.fst(dt, "dataset.fst")
Error in write_fst(x, path, compress, uniform_encoding) : Unknown type found in column.
Hi @robbig2871, thanks for your feature request!
Being able to store list elements would certainly be very useful. The thing is that a list element can contain any type of R object. Such an element would have no meaning in other languages (e.g. Python, Julia or C++).
What we could do is mark such a column with metadata:
- the specific language used to create the column
- the specific language version
- any other (language specific) features required
That should give fst enough information to determine if the column can be used for the current platform.
Because the element needs to be compressed by R itself, list columns can only be compressed using a single thread, so they would be much slower than other types of columns. You would have random (row) access though :-)
Additionally: for your use-case, the list elements are simple vectors:
dt[, sapply(value, typeof)]
#> [1] "logical" "logical" "logical" "logical" "logical" "logical" "logical"
#> [8] "logical" "logical" "logical" "integer" "integer" "integer" "integer"
#> [15] "integer" "integer" "integer" "integer" "integer" "integer" "double"
#> [22] "double" "double" "double" "double" "double" "double" "double"
#> [29] "double" "double"
It would be nice to implement those in the fst framework natively. That could be done in a language-agnostic manner as with columns which would really be a nice addition to the fst framework.
Related: the sf package geometry columns (which are list-columns) that presently cause write_fst to fail with error message Error in write_fst() : Unknown type found in column. Happy to open a new issue or to just leave this here - this package is great and would be a tremendous help for those of us working with large datasets that contain spatial information, since setting geometry is computationally costly.
Hi @pbaylis, thanks for your feature request!
By implementing sf geometry columns, fst could provide random access to the features, so that would be a nice feature for large datasets with spatial information. Especially later on, when we implement a feature to subset data in a fst file from a filter (e.g. Area > 10000) without loading the data into RAM first (on-disk subsetting).
The speed of accessing list columns will be comparatively low however, because the data in a list is not contiguous in memory, so getting the data from RAM onto disk (and vice versa) requires random writes and reads in memory and that slows things down significantly. Also, serialization will have to be done using the R API, offering less opportunity for parallelism.
Would there also be a language agnostic column type for storing features (perhaps the well-known-binary format) ?
Both random access (for sampling) and on-disk subsetting would be very useful features for my use cases. Too bad about serialization but so it goes. I'm not an expert in how sf operates so I'll defer to your better judgment on having a WKB column type.
As an additional comment for others who would like to use fst for spatial datasets, my current workaround is to save two datasets, one with the id columns + geometry (using saveRDS) and one with id variables + other columns (using write_fst).
I too am interested in getting write_fst and read_fst to be able to read sf. @pbaylis thanks for the great workaround suggestion of how to use saveRDS and write_fst together.
Hi @vlulla, thanks for your request!
Yes, being able to store lists would add a lot of flexibility to fst. When we have that, you can think of all kinds of applications such as storing geometry or a list of models and their results, images, etc.
And all elements would be accessible at random, that's a nice extra feature as compared to storing the full list in native rds format.
Off course, using the R API to serialize each element would be very slow, mainly because only a single thread can be used. The result can be compressed with LZ4 or ZSTD however, and that offers a large improvement in speed as compared to the gzip, bzip2 or xy compressors available in base R.
geobuf is a popular binary format for GIS objects and is well-liked by the people behind {sf} and the r-spatial project.
https://github.com/mapbox/geobuf