Feather.jl
Feather.jl copied to clipboard
Feather file causes segfault in R
Saving a DataFrame with Feather causes R to crash when reading the file. I am using Feather 0.5.1 with Julia 1.1. If I create a simple feather file with
df = DataFrame(A = 1:8)
Feather.write("df.feather", df)
I get the following crash in R 3.5.1 with the package feather 0.3.2:
> library(feather)
Warning message:
package ‘feather’ was built under R version 3.5.2
> ir = read_feather("df.feather")
*** caught segfault ***
address 0x10ee5c0b0, cause 'memory not mapped'
Traceback:
1: openFeather(path)
2: feather(path)
3: read_feather("df.feather")
The problem seems to be https://github.com/JuliaData/FlatBuffers.jl/issues/38 On way to fix this is to pin the flatbuffer package at version 0.4.0
Feather files created with julia are bigger than the same dataset created with R. Feather format is important in order to use multiple tools in an analytics workflow.
I'm not too surprised that the files created in Julia are bigger. As I recall, we've seen examples of some of the other writers automatically deciding to write dictionary encoded (i.e. compressed) columns. In this package we only do this if the original column is a CategoricalArray
(i.e. already dictionary encoded). In some cases the resulting difference in file size can be quite huge.
To do the same, we'd need some sort of heuristic for deciding when to automatically use dictionary ecnoded columns.
You could also support PooledArrays, and expect people to use that when they want to save memory.