yardstick icon indicating copy to clipboard operation
yardstick copied to clipboard

Group_by calculate metric

Open SHo-JANG opened this issue 1 year ago • 1 comments

Here's what I want to do specifically. For example, let's say I have monthly trading data for all tickers in the stock market. I want to be able to sort the predicted returns for all stocks by year and month. Then, I want to calculate a statistic for only the top 10% of stocks by predicted return for each month of the year. The specific metric is up to you, but you can calculate the RMSE of the top 10% . The goal is to tune the hyperparameters so that the actual returns of the top 10% predicted stocks are higher.

In other words, I want to find a hyperparameter that tends to get the top 10% right, even if it gets the bottom 90% wrong, rather than getting the whole universe right.

I tried to define custom_metric well, but it was limited. I wanted to put group_by(yearmonth) in the process, but I didn't really know how to do it. so I made a makeshift code.

# customize_metric --------------------------------------------------------
# irr = The return of a portfolio of stocks with a predicted top 10% return, calculated monthly.
# return_pct = monthly cumulative return ratio. ex. 5.1~ 5.31 's cumulative return ratio.
irr_vec <- function(truth,
                    estimate,
                    n_tiles =10,
                    purpose_tile = 10, #predicted top 10% return =10 , bottom 10% = 1
                    na_rm = TRUE,
                    case_weights = NULL,
                    ...) {
  
  
  
  irr_impl <- function(truth, estimate,..., case_weights = NULL) {
    
    
    fold_index <<- NULL
    for( i in 1:dim(valid_years_splited)[1]){
      if(length(truth) == nrow(valid_years_splited$data[[i]]) ){
        fold_index <<-i
      } 
    }
    
    
    valid_years_splited$data[[fold_index]] |> 
      mutate(estimate = estimate) |> 
      group_by(yearmonth) |> 
      mutate(top_n_pct = ntile(estimate,10)) |> 
      filter(top_n_pct == purpose_tile) |> 
  #mean_y := portpolio which is composed by predicted return top10% 
      summarise(mean_y = mean(return_pct)) |> 
 #irr := portpolio 1year cumulative return ratio
      mutate(irr =cumprod(mean_y/100+1) )  |> slice_tail(n = 1) |> pull(irr) ->irr
    
    
    #If another folder has same length each other, this code is unusable .
    
# cross -validation summarize by mean default. So I can calculate geometric mean by log( )
    return(log(irr))      
    
    
  }
  
  metric_vec_template(
    metric_impl = irr_impl,
    truth = truth,
    estimate = estimate,
    na_rm = na_rm,
    case_weights = case_weights,
    cls = "numeric"
  )
}


irr <- function(data, ...) {
  UseMethod("irr")
}

irr <- new_numeric_metric(
  irr,
  direction = "maximize"
)

irr.data.frame <- function(data,
                           truth,
                           estimate,
                           na_rm = TRUE,
                           case_weights = NULL,
                           ...) {
  
  
  metric_summarizer(
    metric_nm = "irr",
    metric_fn = irr_vec,
    data = data,
    truth = !!enquo(truth),
    estimate = !!enquo(estimate),
    na_rm = na_rm,
    case_weights = !!enquo(case_weights)
  )
}

In the custom metric code, valid_years_splited is the result of organizing time-series-cross-validation into 3 folders with 1 year term. It is also defined as a global variable via <<-. This results in three rows, each containing one year's worth of monthly stock trading data for all sectors. This is what we did to calculate the metric per folder.

However, I realize that this is not a perfect solution.

SHo-JANG avatar Apr 16 '23 08:04 SHo-JANG

Hello @SHo-JANG 👋

I like the idea of what you are trying to do. Would you be able to show some example input data and what you would want the output to look like? I wanna make sure I completely understand what you are trying to accomplish before giving feedback 😄

EmilHvitfeldt avatar Apr 18 '23 17:04 EmilHvitfeldt