ggalluvial icon indicating copy to clipboard operation
ggalluvial copied to clipboard

profile code and optimize bottlenecks

Open corybrunson opened this issue 6 years ago • 4 comments

Description of the issue

Diagrams for large datasets take a long time to render. The bottlenecks might be due to inefficiencies in the code. Profile the code, identify the bottlenecks, and benchmark alternative implementations. (See this chapter in Advanced R.)

Reproducible example (preferably using reprex::reprex())

(Need a suitable public dataset.)

corybrunson avatar Apr 03 '18 18:04 corybrunson

@corybrunson This package is awesome. Thank you for taking the time to build it! I would love to help out.

Could you tell me which scripts in your /ggalluvial/R folder are relevant when running the following lines of code?

data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
       aes(x = survey, stratum = response, alluvium = subject,
           weight = freq,
           fill = response, label = response)) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  geom_text(stat = "stratum", size = 3) +
  theme(legend.position = "none") +
  ggtitle("vaccination survey responses at three points in time")

In the meantime, I'm hoping to create a data set that contains 5 million rows and 3 columns to use in the reprex.

cenuno avatar Apr 18 '18 16:04 cenuno

@cenuno thank you for saying so! I'd be very glad for the large-scale example. The code chunk you shared relies on functions defined in the files stat-flow.r, geom-flow.r, stat-stratum.r, and geom-stratum.r, and possibly indirectly some code in stat-utils.r, geom-utils.r, and lode-guidance-functions.r. (In general, a layer—usually stat_*() or geom_*()—invokes one stat and one geom, and the stats and geoms are roughly paired up in this package.)

corybrunson avatar Apr 19 '18 01:04 corybrunson

Sweet. I'll start investigating using the vaccinations data set just to get a sense of the workflow. It will probably take awhile but I want - as I'm sure others do as well - this to work with larger data sets.

cenuno avatar Apr 20 '18 14:04 cenuno

library(tidyverse)
library(ggalluvial)

i <- 100
waves <- 10
alluvial_test <- as_tibble(data.frame(id = as.numeric(rep(1:i, each = waves)), 
                             wave = factor(rep(1:waves, i)), 
                             status = factor(sample(rep(c("A", "B", "C", "D"), each = i*waves/4)), levels = c("A", "B", "C", "D"), labels = c("A", "B", "C", "D")))) 


p <- ggplot(data = alluvial_test, aes( x = wave, stratum = status, alluvium = id, fill = status, label = status)) 
p + geom_flow(stat = "alluvium", lode.guidance = "frontback", color = "darkgray") + geom_stratum()

Created on 2021-12-02 by the reprex package (v2.0.1)

increasing i and waves will quickly result in a very slow plot ;-) anyways, for myself grouping by status and just have the transitions between the groups and not the individual ones would be enough... Currently thinking about how to regroup the data. :-) but am currently drawing a blank...

universal avatar Dec 02 '21 15:12 universal