fst Reading partially copied file causes memory-not-mapped crash

It takes some time to copy a big file. If the copy process is terminated before finishing, the file is corrupted. Unfortunately, if a user is unaware of the file size, it would be hard to check the file integrity without reading it. However, reading a partially copied file will crash an R session in the latest release of fst. Below is a simple reproducible example:

data <- data.frame(id = 1:1e8)
for (i in 1:5) {
  cat(i, "\n")
  data[[paste0("x", i)]] <- rnorm(1e8)
}

fst::write_fst(data, "~/data/fst-test.fst")

cd ~/data
cp fst-test.fst fst-test-1.fst

During the process, press Ctrl+Z to suspend cp process and kill it.

Now start an R session and reading the partially copied file will crash like the following:

> fst::read_fst("fst-test-1.fst")
Loading required namespace: data.table

 *** caught segfault ***
address 0xffff80c10bb77458, cause 'memory not mapped'

Traceback:
 1: .Call(`_fst_fstretrieve`, fileName, columnSelection, startRow,     endRow, oldFormat)
 2: fstretrieve(fileName, columns, from, to, old_format)
 3: fst::read_fst("fst-test-1.fst")

My session info:

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.4 parallel_3.4.4 tools_3.4.4    Rcpp_0.12.17   fst_0.8.8

Jun 08 '18 08:06 renkun-ken

Hi @renkun-ken, thanks for reporting! I think that there are two main issues that need to be implemented to solve these kind of instabilities:

Currently, the 16 kB data blocks that make up the bulk of the fst file, are not hashed. So a compressed block of data is transferred to the decompression algorithm without checking the data integrity first. Calculating a hash of a 16kB data block can be done very fast using the xxHash algorithm already contained within the LZ4 and ZSTD libraries (multiple GB/s). The format is already prepared to store block hashes so there is no format change required when these are implemented (see also #49).
Some meta-data at the beginning of the file is stored after the actual data blocks are written to disk. When a write is interrupted, that meta-data will be in an incorrect state. This problem can be easily solved by adding some additional checks and also making sure that all meta-data is hashed.

With release v0.8.8, all meta-data is initialized to zero before writing to file. Did you encounter these problems with v0.8.6 as well?

thanks!

Jun 08 '18 21:06 MarcusKlik

I tested with v0.8.6 and reading a partially copied file seems to hang forever. With prior versions, it may also crash.

Jun 09 '18 01:06 renkun-ken

Hi @renkun-ken, I think the hanging is due to the compressor getting an incomplete data block to decompress or due to incorrect meta-data making the file pointer jump to the same location in the file again and again. Fixing this instability is important, I have added it to the next release milestone.

Thanks for providing the example code that reproduces the issue (most of the time :-))!

Jun 11 '18 18:06 MarcusKlik