bigreadr icon indicating copy to clipboard operation
bigreadr copied to clipboard

big_fread1 function and the possibility to parallelize it

Open laurentGithub13 opened this issue 2 years ago • 8 comments

Hello, I have a very huge file to process and a small RAM memory. So I decided to use the big_fread1 function. I use a window of 200 lines and on every window I did the calculation I want to do on only three columns of my dataframe. The script below works fine on a file with a not hyge size but on my actual file it lasts many, many hours. May you please tell me if it would be possible to parallelize easily this script ? I can do the calculation independendly in every window and after I could sort the result on the time variable to put the result together in the good order.

Thank you Best Regards Laurent

# my text file containing the dataframe
csv <- "./txt/my_very_huge_file.txt"

my_results <- big_fread1( csv
                             , every_nlines = 200
                             , skip = 0
                             , header = TRUE
                             , .transform = function(df) {
  df %>%
     dplyr::select(tindex, param1,param2,param3)  %>%
                 summarise(    time     = sum(range(tindex))/2
                        , param1 = my_function(.[,"param1"])
                        , param2 = my_function(.[,"param2"])
                        , param3 = my_function(.[,"param2"])
                        )

     }
)

laurentGithub13 avatar Jan 20 '23 13:01 laurentGithub13

What are the size of the file, the number of lines and number of columns?

privefl avatar Jan 20 '23 14:01 privefl

File size : 6 Go 4 columns 81210693 lines (initialy the file had 19 variables, that's why I do a dplyr::select in my script but I found the mean to reduce the number of columns)

laurentGithub13 avatar Jan 20 '23 21:01 laurentGithub13

If you have 81M lines and only 4 columns, you should use a much larger every_nline, maybe up to 1M.

privefl avatar Jan 21 '23 13:01 privefl

Yes, I will try that. In fact, for physical conditions, I have to do my calcultion en every 200 lines, that's why I chose every_line=200. So I modified my script to do the calculation on every 200 lines in every chunk of 1e+6 lines as bellow:

my_results <- big_fread1( csv , every_nlines = 1e+6 , skip = 0 , header = TRUE , .transform = function(df) { cat("Current process: data between ", min(df$tindex) ," and ", max(df$tindex) , "\n") df <- as.data.frame(df) df$group <- ordered( as.numeric( cut_number( tindex, n = (1+e6)/200 ) ) ) df %>% dplyr::select(tindex, param1,param2,param3,group) %>% dplyr::group_by( group ) %> summarise( time = sum(range(tindex))/2 , param1 = my_function(.[,"param1"]) , param2 = my_function(.[,"param2"]) , param3 = my_function(.[,"param2"]) )

 }

) I launched the script......

laurentGithub13 avatar Jan 22 '23 15:01 laurentGithub13

The spliting is faster but I have still the same problem of time. In fact it is my calculation which takes a long time. Would it be possible to parallelize the work pertained to every chunk of lines ?

laurentGithub13 avatar Jan 23 '23 21:01 laurentGithub13

Yes, it should be possible. You can just adapt the source code of the function by replacing lapply() with foreach(). You should not need to redo the splitting of the file (the first step). If you do not know how to use foreach for parallelism, have look at this tutorial.

privefl avatar Jan 24 '23 07:01 privefl

Thank you for the link. I will see that and try to do something. thx

laurentGithub13 avatar Jan 24 '23 18:01 laurentGithub13

Any update on this?

privefl avatar Aug 17 '23 08:08 privefl