tidyquant icon indicating copy to clipboard operation
tidyquant copied to clipboard

design choice: why legacy wide format for some functions?

Open apsteinmetz opened this issue 7 years ago • 3 comments

In your vignette, "Performance Analysis with tidyquant" you choose to make a table, "RaRb," to compare an asset to its benchmark. This is wide data. I would argue that the benchmark is an asset like the stock or portfolio and shoulc be part of the long format data set. This is how ggplot will take the data to compare the portfolio and it's benchmark. Tidyquant functions that reference Ra and Rb should, in effect, use filter() to get whatever named asset and benchmark that's relevant without the user having to go through step "3b" of the workflow described in the vignette. Be tidy all the way.

set.seed(12345)
RaRb_tibble<-tibble(date=as.Date(as.yearmon("2009-01-01")+seq(1/12,3,1/12),frac=1),
             Ra=rnorm(3*12,0.06/12,.01),
             Rb=rnorm(3*12,0.055/12,.01))


#current tidyquant design
RaRb_tibble %>%
  tq_performance(Ra = Ra, Rb = Rb, performance_fun = table.CAPM)

#It must be tidy to do common plots
RaRb_tidy<-RaRb_tibble %>% 
  gather(asset,return,-date) %>% 
  group_by(asset) %>% 
  mutate(wealth.index=cumprod(1+return))

ggplot(RaRb_tidy,aes(x=date,y=wealth.index,color=asset))+geom_line()

BTW, Tidyquant is awesome. I have been getting tidy-er lately and have been annoyed at switching back and forth with xts to do analysis. Thanks also for the detailed vignettes. The TLC and effort you put into them really shows!

apsteinmetz avatar Jun 21 '17 01:06 apsteinmetz

Yes, this seems like an odd choice considering that we are technically using a wide format. The reason we have the assets and the benchmark in wide format is because essentially every function that compares two sets of data in R requires a wide format. For example, a simple correlation in R uses two columns (an x and a y). Because of the setup of the infrastructure, it's easier in this situation to use non-tidy data. Otherwise, we'd have to recode all the functions to work how we want them and that's not feasible at the moment.

With ggplot2, you can always add a line that uses a different set of the data. For example, say you have a graph that uses RaRb_tibble:

RaRb_tibble %>%
    ggplot(aes(x = date, y = Ra, color = symbol)) +
    geom_line()

You can add the Rb in like so, where ideally you'd have a column that you could filter on such as symbol and just filter by the first symbol.

    geom_line(aes(y = Rb), data = filter(RaRb_tibble, symbol = first_symbol))

As you indicate, graphing with non-tidy data can be a pain, but there's some workarounds to get what you need into ggplot.

mdancho84 avatar Jun 21 '17 12:06 mdancho84

The reason we have the assets and the benchmark in wide format is because essentially every function that compares two sets of data in R requires a wide format.

True, but most of what tidyquant does is allow the user to feed long tidy data to finance packages that require a wide format. Just seems odd in this instance.

In my typical use case, it gets squirrelly because I often compare multiple portfolios against multiple benchmarks. I use relational table to link the portfolio to its benchmark.

Thanks for thinking about this.

apsteinmetz avatar Jun 21 '17 18:06 apsteinmetz

I agree, and I can see how that get's challenging since some data is long and some is wide.

It's difficult to get every aspect of finance to fit the tidy mold. In this instance let's keep it in the wide format unless there is a pressing need to change (i.e. you need tidy data because it's excessively difficult or impossible to do something).

mdancho84 avatar Jun 23 '17 03:06 mdancho84