big_fread1 function and the possibility to parallelize it
Hello, I have a very huge file to process and a small RAM memory. So I decided to use the big_fread1 function. I use a window of 200 lines and on every window I did the calculation I want to do on only three columns of my dataframe. The script below works fine on a file with a not hyge size but on my actual file it lasts many, many hours. May you please tell me if it would be possible to parallelize easily this script ? I can do the calculation independendly in every window and after I could sort the result on the time variable to put the result together in the good order.
Thank you Best Regards Laurent
# my text file containing the dataframe
csv <- "./txt/my_very_huge_file.txt"
my_results <- big_fread1( csv
, every_nlines = 200
, skip = 0
, header = TRUE
, .transform = function(df) {
df %>%
dplyr::select(tindex, param1,param2,param3) %>%
summarise( time = sum(range(tindex))/2
, param1 = my_function(.[,"param1"])
, param2 = my_function(.[,"param2"])
, param3 = my_function(.[,"param2"])
)
}
)
What are the size of the file, the number of lines and number of columns?
File size : 6 Go 4 columns 81210693 lines (initialy the file had 19 variables, that's why I do a dplyr::select in my script but I found the mean to reduce the number of columns)
If you have 81M lines and only 4 columns, you should use a much larger every_nline, maybe up to 1M.
Yes, I will try that. In fact, for physical conditions, I have to do my calcultion en every 200 lines, that's why I chose every_line=200. So I modified my script to do the calculation on every 200 lines in every chunk of 1e+6 lines as bellow:
my_results <- big_fread1( csv , every_nlines = 1e+6 , skip = 0 , header = TRUE , .transform = function(df) { cat("Current process: data between ", min(df$tindex) ," and ", max(df$tindex) , "\n") df <- as.data.frame(df) df$group <- ordered( as.numeric( cut_number( tindex, n = (1+e6)/200 ) ) ) df %>% dplyr::select(tindex, param1,param2,param3,group) %>% dplyr::group_by( group ) %> summarise( time = sum(range(tindex))/2 , param1 = my_function(.[,"param1"]) , param2 = my_function(.[,"param2"]) , param3 = my_function(.[,"param2"]) )
}
) I launched the script......
The spliting is faster but I have still the same problem of time. In fact it is my calculation which takes a long time. Would it be possible to parallelize the work pertained to every chunk of lines ?
Yes, it should be possible.
You can just adapt the source code of the function by replacing lapply() with foreach().
You should not need to redo the splitting of the file (the first step).
If you do not know how to use foreach for parallelism, have look at this tutorial.
Thank you for the link. I will see that and try to do something. thx
Any update on this?