fst
fst copied to clipboard
Write failure corrupts existing file
require(fst) # 0.9.0
df <- data.frame(x=1, y=2)
ff <- tempfile()
write_fst(df, ff)
read_fst(ff)
# x y
# 1 1 2
df <- data.frame(x=1, y=2+3i)
write_fst(df, ff)
# Error in write_fst(df, ff) : Unknown type found in column.
read_fst(ff)
# Error in read_fst(ff) :
# It seems the file header was damaged or incomplete
I think write should not corrupt existing files. Perhaps write to a temp file in the same folder and rename it on success?
Hi @arunsrinivasan, thanks for reporting!
Indeed, it's not very nice to leave the user with a corrupt file where there was a correct file earlier. On the other hand, because the user already had the intention of overwriting the previous file, that might not be a very big issue for the user, what do you think?
In this case, would it be a solution to scan the column types before starting the write? That way, the user will be informed very quickly on incompatible column types and the previous file won't be affected. After such a scan, the ordering of columns (on disk) could also be changed to have the most effective ordering for speed (for example, in theory we could make a more effective schedule for the over stressed master thread and start collecting string sizes from a character column while using the other threads to write out an integer column).
Off course, your solution is more general and other possible errors will be handled elegantly as well.
thanks
On the other hand, because the user already had the intention of overwriting the previous file, that might not be a very big issue for the user, what do you think?
Not sure about this. Suppose, every minute, new data comes in. And the old data needs to be loaded and merged with new data, and duplicate rows in old
must be replaced with new
if they exist, i.e., some columns must be replaced with new values and written back to disk, then if the file gets corrupted, the old data is gone (particularly, if there's a script running in the background to do this). I can think of ways to manually backup the file before doing this, but think it should be handled in fst.
In this case, would it be a solution to scan the column types before starting the write?
Yes, an early check/error would be a great intermediate that avoids most cases I'd think.
Hi @arunsrinivasan,
thanks, yes, very true. Scanning early will mitigate the problems with fst
writing incomplete files when unsupported column types are present. So definitely our best option to solve this issue!
By the way is there another issue where we handle the same error that is momentarily seen while the file is beng written and read, a situation very similar to what @arunsrinivasan explained where every minute new data arrives (from IOT devices in my case) and the data.frame is read using read.fst(...as.datatable=T)
, merged with new data (using merge.data.table
)and file overwritten using write.fst()
. So while now there is no risk of overwriting with a corrupted file (hopefully) but the simultaneous read and write situation is perhaps causing the same error.
Error in read_fst(ff) :
It seems the file header was damaged or incomplete
Is there a way the read can wait till the write is over?
The combo of data.table + fstpackage gives terrific speed in the read-process-write cycles. It's a pity that this combo is now not reliable enough to implement in large IOT production systems where data is being loaded every minute to thousands of fst files. Even if one file gets corrupted the entire processing pipeline comes to a halt as R crashes.
@arunsrinivasan : Since the fstpackage author @MarcusKlik is not responding, and I know you are the author of the data.table package, would you have any other advice for me?