fst
fst copied to clipboard
Cannot save tibble where one of the columns is a list of some kind
Mark what an awesome, super fast package. I encountered a new issue, when one of the columns is a list (instead of a vector of an atomic type). Tibbles, aka "data_frame" allow for columns to be lists of arbitrary types provided the latter have the same length than other columns which are regular vectors. For example:
note: data_frame() is the same as tibble()
myDf <- tibble::data_frame(c1=c(1,2,3), c2=list(c(4,5),c(6,7,8),c(9,10,11,12))) myDf
a tibble 3x2 c1 c2 <dbl> <list> 1 1 <dbl [2]> 2 2 <dbl [3]> 3 3 <dbl [4]>
write.fst() produces an error on account of the list column
fst::write.fst(myDf, "test.df")
Error in fstStore(path, x, as.integer(compress)) : Unknown type found in column.
Any help fixing this would be greatly appreciated
Hi @carioca67, thanks a lot! Indeed, fst is not supporting list type columns yet. But it's definitely planned for one of the next releases (see also #12 and #20).
The list type poses somewhat of a design challenge, since fst is a pure C++ library at it's core. A list column can have any type as it's elements, so in general each element should be considered a serialized R object. Such objects have no meaning in the general C++ core library (and would be useless when loading from Python for example).
However, I was planning to define a type like blob, that can be found on most databases. A blob would correspond to a raw vector in R (and any object in R can be serialized to a raw vector). In effect, fst would be calling serialize on each element of the list column:
dt <- data.table(A = list(1, "string", 3.141592)) # list column
sapply(dt$A, serialize, NULL) # serialize each element
#> [[1]]
#> [1] 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 0e 00 00 00 01 3f
#> [24] f0 00 00 00 00 00 00
#>
#> [[2]]
#> [1] 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 10 00 00 00 01 00
#> [24] 04 00 09 00 00 00 06 73 74 72 69 6e 67
#>
#> [[3]]
#> [1] 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 0e 00 00 00 01 40
#> [24] 09 21 fa fc 8b 00 7a
(and the raw vectors are serialized to fst's blob type).
A language attribute could signal a serialized object in a specific language and appropriate warnings could be generated when reading from a fst file that contains serialized objects generated from within a different language.
Because of the serialization of each element, a list column would be relatively slow to read and write (note also the extremely inefficient serialization of e.q. a single integer to 30 bytes). But fst would allow for random (row) access to the list column elements, just as for other columns.
Anyway, data.frame's, data.table's and tibble's all support list types, so indeed it is a very necessary feature to have.
Thanks for your request!
Ho Mark thanks for taking a look at my request
How does saveRDS() solve this problem?
On Jul 11, 2017 4:18 PM, "Mark Klik" [email protected] wrote:
Hi @carioca67 https://github.com/carioca67, thanks a lot! Indeed, fst is not supporting list type columns yet. But it's definitely planned for one of the next releases (see also #12 https://github.com/fstpackage/fst/issues/12 and #20 https://github.com/fstpackage/fst/issues/20).
The list type poses somewhat of a design challenge, since fst is a pure C++ library at it's core. A list column can have any type as it's elements, so in general each element should be considered a serialized R object. Such objects have no meaning in the general C++ core library (and would be useless when loading from Python for example).
However, I was planning to define a type like blob, that can be found on most databases. A blob would correspond to a raw vector in R (and any object in R can be serialized to a raw vector). In effect, fst would be calling serialize on each element of the list column:
dt <- data.table(A = list(1, "string", 3.141592)) # list column
sapply(dt$A, serialize, NULL) # serialize each element #> [[1]]#> [1] 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 0e 00 00 00 01 3f#> [24] f0 00 00 00 00 00 00#> #> [[2]]#> [1] 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 10 00 00 00 01 00#> [24] 04 00 09 00 00 00 06 73 74 72 69 6e 67#> #> [[3]]#> [1] 58 0a 00 00 00 02 00 03 04 01 00 02 03 00 00 00 00 0e 00 00 00 01 40#> [24] 09 21 fa fc 8b 00 7a
(and the raw vectors are serialized to fst's blob type).
A language attribute could signal a serialized object in a specific language and appropriate warnings could be generated when reading from a fst file that contains serialized objects generated from within a different language.
Because of the serialization of each element, a list column would be relatively slow to read and write (note also the extremely inefficient serialization of e.q. a single integer to 30 bytes). But fst would allow for random (row) access to the list column elements, just as for other columns.
Anyway, data.frame's, data.table's and tibble's all support list types, so indeed it is a very necessary feature to have.
Thanks for your request!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fstpackage/fst/issues/71#issuecomment-314544736, or mute the thread https://github.com/notifications/unsubscribe-auth/APeqlseGAOAdUmYZaZZcioKRcqRersWOks5sM8qBgaJpZM4OUk9I .
Hi @carioca67, saveRDS only has to work for R, so the format uses a lot of R-specific data. But fst was designed to work from other languages as well (basically any language that can wrap a C++ library), and R-specific data (such as serialized list elements) are meaningless in other languages than R. We can still serialize a list element as a block of bytes (like a blob) but when read from Python it will be just that (e.g. Python won't know the element was time-series data).
It might help to take a closer look at R's internal serialization code, perhaps it's possible to call that code directly from C++ to have faster serialization of the list elements 👍
Thanks!
I'm also looking for this feature, is this planned for any release soon?
This is indeed a very interesting topic and would be a nice feature for the (already great) package fst.
It might help to take a closer look at
R's internal serialization code, perhaps it's possible to call that code directly from C++ to have faster serialization of the list elements
@MarcusKlik, you may want to have a look at package RApiSerialize (https://github.com/eddelbuettel/rapiserialize), which is used by package qs providing a powerful fast alternative to readRDS()/saveRDS()
Hi @riccardoporreca, thanks, yes, I've studied rapiserialize before and it's definitely very useful when serializing from C++. The downside is that it stores the metadata in each serialized element if I'm not mistaken, so that leads to some overhead.
However, I could use rapiserialize to serialize each block in the fst format, in that case the overhead would be added to each block but not to each element inside the block (so when reading, the full block needs to be deserialized, even if only a single element is requested).
Because (de-) serialization needs to be done on the main thread, the speed will be limited to the maximum (de-) serialization speed...