metacoder icon indicating copy to clipboard operation
metacoder copied to clipboard

Running in paralle or multithreading option

Open susheelbhanu opened this issue 4 years ago • 2 comments

Hey there..

Thanks the nice tool. Is there a way to run the final matrix in a parallel or multi-threaded manner. I have the following with a lot of rows (~ 1,248,624):

> print(obj$data$diff_table)
# A tibble: 1,248,624 x 7
   taxon_id treatment_1 treatment_2 log2_median_ratio median_diff mean_diff wilcox_p_value
   <chr>    <chr>       <chr>                   <dbl>       <dbl>     <dbl>          <dbl>
 1 aab      gp_1_Early  gp_2_Early              0         0        0.0278            0.803
 2 aac      gp_1_Early  gp_2_Early              0         0        0               NaN    
 3 aad      gp_1_Early  gp_2_Early              0.517     0.277    0.206             0.183
 4 aae      gp_1_Early  gp_2_Early              1.12      0.0107  -0.0715            0.832
 5 aaf      gp_1_Early  gp_2_Early              0         0       -0.000119          0.516
 6 aag      gp_1_Early  gp_2_Early              0         0       -0.00519           0.191
 7 aah      gp_1_Early  gp_2_Early              0         0       -0.00531           0.167
 8 aai      gp_1_Early  gp_2_Early              0         0       -0.0452            0.146
 9 aaj      gp_1_Early  gp_2_Early              0.721     0.00421 -0.0647            0.964
10 aak      gp_1_Early  gp_2_Early           -Inf        -0.0300   0.0313            0.256
# … with 1,248,614 more rows

And i'm trying to plot the final figure using the code below:

heat_tree_matrix(obj,
                 data = "diff_table",
                 node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
                 node_label = taxon_names,
                 node_color = log2_median_ratio, # A column from `obj$data$diff_table`
                 node_color_range = diverging_palette(), # The built-in palette for diverging data
                 node_color_trans = "linear", # The default is scaled by circle area
                 node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                 edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
                 node_size_axis_label = "Number of ASVs",
                 node_color_axis_label = "Log2 ratio median proportions",
                 layout = "davidson-harel", # The primary layout algorithm
                 initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
                 output_file = "differential_heat_tree.pdf", # Saves the plot as a pdf file      
                 key_size = 0.4, # adjust the size of the "key" tree with respect to the plot
                 row_label_size = 16, col_label_size = 16,
                 node_label_size_range = c(0.01,0.05) # node font size, see here:https://github.com/grunwaldlab/metacoder/issues/245
                  )

It's been running for 69h 51m so far and still no end in sight. So having a multi-thread option may help with the time.

Thanks a lot, Susheel

susheelbhanu avatar Jul 30 '20 06:07 susheelbhanu

Hello, sorry for the delay. There is no such option, but it would be a useful addition. I will look into it when I get a chance. 70h sounds really long, even for a large dataset. I would recommend the following ways to optimize it:

  • Filter out taxa you don't need to plot. You can do this before calculating differences or after. If you have a lot of comparisons, too many taxa will be confusing anyway. Remove uncommon taxa, or remove ranks in the taxonomy. Start small and add taxa until takes too long.
  • Split it up into multiple plots if possible.
  • Check that you are not running out of RAM. If you are running out of RAM and it using swap, it will be really slow.

zachary-foster avatar Aug 03 '20 17:08 zachary-foster

Hey @zachary-foster Thanks a lot for the pointers. Yeah, it did run for longer than that, and eventually finished though - so all in all, a good exercise. I did trim a lot of them out, but it was one of those datasets with a lot of samples and groups to begin with.

Will certainly try the multiple plots option. It was mostly running on RAM so not an issue there. Given the nearly 1 million lines to parse, I figured it must have been the case.

While I'm here though, it would also be beneficial if there was a way for one to adjust the legend outside of the plotting function itself? Especially w.r.t. the font sizes etc. 'Cos If i'd want to adjust the legend size, I'd have to run the full plot and wait till I see if it worked, right? Or is there an alternative?

Thanks again!

susheelbhanu avatar Aug 04 '20 07:08 susheelbhanu