disk.frame
disk.frame copied to clipboard
Convenient method to write to csv
That would be awesome. I mean, it should be dead simple, but the issue I bump into is speed:
fsts <- data.table(path=list.files("data/ft_df/", full.names = T), order=list.files("data/ft_df/")) fsts <- fsts %>% mutate(number=extract_numeric(order) %>% as.numeric()) %>% arrange(number)
pb <- progress_bar$new( format = " appending [:bar] :percent eta: :eta", total = 100, clear = FALSE, width= 60)
for (i in fsts$path){ df <- fst::read.fst(i, as.data.table = T) fwrite(df, "data/fasttextfile/df.csv", append = T) pb$tick() } appending [==>-----------------------------] 10% eta: 2h
thats for 40 files a 700 mb or so. Why is it so slow? Is it really the for loop? for 40 objects that does not matter, I would think.
I think I have not used progress bar correctly, the total should be 40. Nonetheless, speed seems rather slow. Wont restart it though. it got 20 % now. Nonetheless, could one write to csv in parallel on multiple cores, somehow? that would be much faster.
im quite sure that is possible, I did that some time ago myself. with a parallel for loop. I remember, it worked.
by unparallizing data.table and using mclapply that was it I think. https://github.com/Rdatatable/data.table/issues/1727
so technically no for loop :)
aaand its done already. for a 50 gb file roughly 15 minutes ? that is acceptable, but could it be faster?
curiously, the bottleneck appears to be read.fst and not fwrite... one could parallelize read.fst and than write sequentially in a well timed manner?
system.time(fwrite(df, tempfile(), append = T)) user system elapsed 2.339 0.305 0.450 system.time(df <- fst::read.fst(fsts$path[1], as.data.table = T)) user system elapsed 6.389 0.113 6.505
thanks for the anlaysis