yardstick
yardstick copied to clipboard
Group_by calculate metric
Here's what I want to do specifically. For example, let's say I have monthly trading data for all tickers in the stock market. I want to be able to sort the predicted returns for all stocks by year and month. Then, I want to calculate a statistic for only the top 10% of stocks by predicted return for each month of the year. The specific metric is up to you, but you can calculate the RMSE of the top 10% . The goal is to tune the hyperparameters so that the actual returns of the top 10% predicted stocks are higher.
In other words, I want to find a hyperparameter that tends to get the top 10% right, even if it gets the bottom 90% wrong, rather than getting the whole universe right.
I tried to define custom_metric well, but it was limited. I wanted to put group_by(yearmonth)
in the process, but I didn't really know how to do it.
so I made a makeshift code.
# customize_metric --------------------------------------------------------
# irr = The return of a portfolio of stocks with a predicted top 10% return, calculated monthly.
# return_pct = monthly cumulative return ratio. ex. 5.1~ 5.31 's cumulative return ratio.
irr_vec <- function(truth,
estimate,
n_tiles =10,
purpose_tile = 10, #predicted top 10% return =10 , bottom 10% = 1
na_rm = TRUE,
case_weights = NULL,
...) {
irr_impl <- function(truth, estimate,..., case_weights = NULL) {
fold_index <<- NULL
for( i in 1:dim(valid_years_splited)[1]){
if(length(truth) == nrow(valid_years_splited$data[[i]]) ){
fold_index <<-i
}
}
valid_years_splited$data[[fold_index]] |>
mutate(estimate = estimate) |>
group_by(yearmonth) |>
mutate(top_n_pct = ntile(estimate,10)) |>
filter(top_n_pct == purpose_tile) |>
#mean_y := portpolio which is composed by predicted return top10%
summarise(mean_y = mean(return_pct)) |>
#irr := portpolio 1year cumulative return ratio
mutate(irr =cumprod(mean_y/100+1) ) |> slice_tail(n = 1) |> pull(irr) ->irr
#If another folder has same length each other, this code is unusable .
# cross -validation summarize by mean default. So I can calculate geometric mean by log( )
return(log(irr))
}
metric_vec_template(
metric_impl = irr_impl,
truth = truth,
estimate = estimate,
na_rm = na_rm,
case_weights = case_weights,
cls = "numeric"
)
}
irr <- function(data, ...) {
UseMethod("irr")
}
irr <- new_numeric_metric(
irr,
direction = "maximize"
)
irr.data.frame <- function(data,
truth,
estimate,
na_rm = TRUE,
case_weights = NULL,
...) {
metric_summarizer(
metric_nm = "irr",
metric_fn = irr_vec,
data = data,
truth = !!enquo(truth),
estimate = !!enquo(estimate),
na_rm = na_rm,
case_weights = !!enquo(case_weights)
)
}
In the custom metric code, valid_years_splited
is the result of organizing time-series-cross-validation into 3 folders with 1 year term. It is also defined as a global variable via <<-
.
This results in three rows, each containing one year's worth of monthly stock trading data for all sectors. This is what we did to calculate the metric per folder.
However, I realize that this is not a perfect solution.
Hello @SHo-JANG 👋
I like the idea of what you are trying to do. Would you be able to show some example input data and what you would want the output to look like? I wanna make sure I completely understand what you are trying to accomplish before giving feedback 😄