disk.frame icon indicating copy to clipboard operation
disk.frame copied to clipboard

Convenient method to write to csv

Open xiaodaigh opened this issue 4 years ago • 8 comments

xiaodaigh avatar Sep 17 '20 02:09 xiaodaigh

That would be awesome. I mean, it should be dead simple, but the issue I bump into is speed:

fsts <- data.table(path=list.files("data/ft_df/", full.names = T), order=list.files("data/ft_df/")) fsts <- fsts %>% mutate(number=extract_numeric(order) %>% as.numeric()) %>% arrange(number)

pb <- progress_bar$new( format = " appending [:bar] :percent eta: :eta", total = 100, clear = FALSE, width= 60)

for (i in fsts$path){ df <- fst::read.fst(i, as.data.table = T) fwrite(df, "data/fasttextfile/df.csv", append = T) pb$tick() } appending [==>-----------------------------] 10% eta: 2h

thats for 40 files a 700 mb or so. Why is it so slow? Is it really the for loop? for 40 objects that does not matter, I would think.

KnutJaegersberg avatar Mar 08 '21 05:03 KnutJaegersberg

I think I have not used progress bar correctly, the total should be 40. Nonetheless, speed seems rather slow. Wont restart it though. it got 20 % now. Nonetheless, could one write to csv in parallel on multiple cores, somehow? that would be much faster.

KnutJaegersberg avatar Mar 08 '21 05:03 KnutJaegersberg

im quite sure that is possible, I did that some time ago myself. with a parallel for loop. I remember, it worked.

KnutJaegersberg avatar Mar 08 '21 05:03 KnutJaegersberg

by unparallizing data.table and using mclapply that was it I think. https://github.com/Rdatatable/data.table/issues/1727

KnutJaegersberg avatar Mar 08 '21 05:03 KnutJaegersberg

so technically no for loop :)

KnutJaegersberg avatar Mar 08 '21 05:03 KnutJaegersberg

aaand its done already. for a 50 gb file roughly 15 minutes ? that is acceptable, but could it be faster?

KnutJaegersberg avatar Mar 08 '21 06:03 KnutJaegersberg

curiously, the bottleneck appears to be read.fst and not fwrite... one could parallelize read.fst and than write sequentially in a well timed manner?

system.time(fwrite(df, tempfile(), append = T)) user system elapsed 2.339 0.305 0.450 system.time(df <- fst::read.fst(fsts$path[1], as.data.table = T)) user system elapsed 6.389 0.113 6.505

KnutJaegersberg avatar Mar 08 '21 06:03 KnutJaegersberg

thanks for the anlaysis

xiaodaigh avatar Mar 16 '21 22:03 xiaodaigh