trading-rl icon indicating copy to clipboard operation
trading-rl copied to clipboard

Results discrepancies

Open mg64ve opened this issue 4 years ago • 17 comments

Thanks @Kostis-S-Z for your clarification, it makes sense. I still have two doubts/questions.

  1. In my experiments I have noticed discrepancy between rewards and PNL. While the model was getting positive reward, the PNL was in fact negative or decreasing. From your paper I can see that the more the s(t) is close to the price, the less is the difference between PNL and reward, with an error e(t) = abs(s(t)-p(t)). Now, could you give us some clarification on why we are getting this discrepancy between rewards and PNL?
  2. In your paper you are not considering Walk and Forward analysis. You took 12000 values in the past and trained the model on that data, then you use 2000 samples for testing. 2000 samples at 4H timeframe is almost 2 years without retraining. Do you think that with Walk and Forward retraining you would get better results?
  3. I would add one more point: from maths point of view, what is relationship between rewards and pnl in terms of p(t) and s(t)?

I would really appreciate a feedback from you on these points. Thanks.

mg64ve avatar Dec 31 '19 10:12 mg64ve

Hello @mg64ve ,

  1. I cannot be certain why the PnL in your results is largely negative since we might have different implementations of how we are calculating it. Are you sure your PnL calculation code is bug free? Have you tested in different trade actions? If its correct, maybe the agent does learn a strategy but it is not optimal for profit. Remember that the agent learns to optimize for the reward function and the reward function does not contain the PnL. So you might need to experiment a bit with the variables turn, margin, cost etc in order to find a strategy / policy that suits your data. Also note that when calculating the PnL you should take into consideration this note in the paper:

Finally, note that the agent never exits the market, i.e., the stay action corresponds to maintaining he latest open position (either long or short), while simply avoiding updating its internal price trend.

  1. We experimented a lot with different sample rates (how frequent the exchange rate sample are) and we decided to use 4h over long periods of time (years). This was decided because we wanted to show that the agent could try to learn basic long lasting market trends without overfitting to a specific short time period (e.g analyzing rates in minutes or even hourly in 2019 might show completely different market trends than 2009). Of course there are many more experiments you can try and you will get different results. High frequency trading for example probably needs to be tackled in a different approach altogether. In conclusion, due to the uncertainty of both the Deep Learning algorithms and the financial market, it is difficult to speculate if a method would work without trying it first. I am sorry if this vague and anticlimactic answer was not what you expected but I do not want to encourage / discourage you on something I am fairly uncertain :)

Kostis-S-Z avatar Dec 31 '19 15:12 Kostis-S-Z

Thanks @Kostis-S-Z Unfortunately I am on vacations and I don't have my code with me, but I will experiment more when I am back. Basically your paper is a demonstration on how RL can learn a policy? This policy is basically a financial time series trend. But if the algorithm can really follow the trend, it should also make profit and PNL should be positive. Unless that it has a lag with the respect to trend change points. It seems to have some lag in your graphs, but this should not be so relevant. I need more time to look at it. Could you please keep open this question?

mg64ve avatar Dec 31 '19 16:12 mg64ve

Your comment doesn't make sense. If you change positions from a long to a short or a short to a long, you'll have to exit the market to exit your position. Then, you'd buy the opposing position. This is when your PnL would be calculated.

Finally, note that the agent never exits the market, i.e., the stay action corresponds to maintaining he latest open position (either long or short), while simply avoiding updating its internal price trend.

Can you share your PnL calculation code? I wouldn't expect this to be proprietary, since all you're doing is calculating a PnL during a long to short or short to long position change.

personal-coding avatar Dec 31 '19 22:12 personal-coding

@mg64ve I just saw that you have edited your question to add one more point which I haven't addressed, sorry I missed that. I can get back to you on this when I find some time to go over the project again.

@ScrapeWithYuri Unfortunately this part of the code is indeed proprietary and I am not allowed to publish it, this was due to a cooperation with a private company while working on this project.

Kostis-S-Z avatar Jan 07 '20 19:01 Kostis-S-Z

@Kostis-S-Z I believe you are busy again with job, since I see you had not time to reply. No worries. I am also back from vacation and had some time to check the code and run it again. I am still very skeptical that this might not contain and leakage, since the results are too good to be true. And when results are too good I become skeptical. Financial markets are very noisy and difficult and it is not easy to get good results. The following is a screenshot of my profit curve. There is a discrepancy because it is much better of what you published and I don't think it is correct. This result it is by training with cost.

image

and the following is an example of it holds positions, not really bad at all:

image

my pnl calculation function is the following:

    def pnl_of_trades(self, env_type, actions, values, slippage=0.0):
        """
        Function to calculate PnL based on trades
        """
        #warnings.warn("No method implemented to calculate the PnL! Returning zero...", Warning)
        #return 0
        prices_diff=np.concatenate([[0.0],np.diff(values)])
        pnl=np.cumsum([actions*prices_diff])
        plt.plot_profit(self.folder,pnl,values,actions)
        return(pnl[-1])

I wonder to know what do you think about it? do you see any mistake @ScrapeWithYuri , @stevexxs ? Please drop me your comments. Thanks.

mg64ve avatar Jan 09 '20 10:01 mg64ve

Can you send the plot_profit code?

personal-coding avatar Jan 11 '20 17:01 personal-coding

@ScrapeWithYuri you can find my code here:

https://github.com/q-learning/trading-rl

dataset contains EURSD OHCL+Volume years 2007-2015. Please let me know what you think about and help me to find bugs in my code. Thanks

mg64ve avatar Jan 12 '20 11:01 mg64ve

@Kostis-S-Z @ScrapeWithYuri

I found a bug in my code. That code was still incrementing the position after calculating the reward. I have changed and still the profit is still too good. This is something we can't trust:

image

However I have also added another profit calculation in the get_reward function. This is how my get_reward function looks like:

    def get_reward(self):
        """
        The reward function of the agent. Based on his action calculate a pnl and a fee as a result
        Normalize the reward to a proper range
        """

        up_margin = self.data[self.position] + self.margin
        c_val = self.data[self.position]
        pr_val = self.data[self.position - 1]
        profit = 0
        down_margin = self.data[self.position] - self.margin

        # Because its almost impossible to get the exact number, use an acceptable slack
        if np.abs(c_val - self.value) < 0.00001:
            reward = 1
        elif self.value <= c_val:
            reward = ( self.value - down_margin ) / ( c_val - down_margin )
        else:
            reward = ( self.value - up_margin ) / ( c_val - up_margin )

        change_position_cost = 0.0
        if self.action == BUY:
            profit = c_val - pr_val
        elif self.action == SELL:
            profit = pr_val - c_val


        if self.ce:
            if self.action != self.prev_action:
                profit -= change_position_cost
                reward = reward - np.abs(reward * self.cost)
        else:
            if (self.action == BUY or self.action == SELL) and (self.action != self.prev_fin_pos):
                profit -= change_position_cost
                reward = reward - np.abs(reward * self.cost)

        if self.dp:
            if ((self.prev_action == BUY) and (self.action == SELL)) or ((self.prev_action == SELL) and (self.action == BUY)):
                profit -= change_position_cost
                reward = reward - np.abs(reward * self.cost)

        self.trade(c_val)
        self.prev_action = self.action

        return reward, profit

I am plotting this profit as well, and there is a big discrepancy with the vectorized profit calculation:

image

I tend to think the second one is more realistic. But it is not good at all. But I would say the big discrepancy is the rewards are positive and PNL is not. Please let me know what you think about it. Thanks.

mg64ve avatar Jan 14 '20 10:01 mg64ve

@mg64ve is your second profit only calculating the change between the current position and the day before?

For instance, shouldn't pr_val be the value when you made the last buy / sell short decision, rather than the prior day?

        c_val = self.data[self.position]
        pr_val = self.data[self.position - 1]
        if self.action == BUY:
            profit = c_val - pr_val
        elif self.action == SELL:
            profit = pr_val - c_val

Based on the way I read this, the change is between the current position and the prior day. This would be negative, since your FX dataset is declining in value during the test period.

personal-coding avatar Jan 14 '20 16:01 personal-coding

Very good point @ScrapeWithYuri .

In general we can say that if a position is taken for T periods, than the profit is:

profit = price( t+T) - price(t)

and it is the same than:

profit = price(t+T) - price(t+T-1) + price(t+T-1) - price(t+T-2)+ .... + price(t+1) -price(t)

but this works if only have BUY and SELL actions. In this case we also have the NEUTRAL=0 action which I haven't considered yet. Let me change my code and I will let you. Cheers.

mg64ve avatar Jan 14 '20 17:01 mg64ve

@mg64ve It looks like your pnl_of_trades has a similar logic. The code is looking at the difference in return between time t and t-1 and computing the return based on the action for that period. However, @Kostis-S-Z stated in issue #6 that the calculation should be for time t+1. The bot is deciding what action to make based on data as of time t and making a decision for time t+1.

personal-coding avatar Jan 15 '20 00:01 personal-coding

good point @ScrapeWithYuri When your differentiate a time series, let's say:

x[0], x[1], x[2], .... , x[t]

and you get :

d[0], d[1], d[2], ..., d[t-1]

you need to take in account that, at time t=0, if your action is a[0] (consider either SELL or BUY only), then your profit needs to be calculate as:

a[0] * d[0]

so my calculation is wrong for 2 reason:

prices_diff=np.concatenate([[0.0],np.diff(values)])
pnl=np.cumsum([actions*prices_diff])

First is wrong because it should be:

prices_diff=np.concatenate([np.diff(values),[0.0]])
pnl=np.cumsum([actions*prices_diff])

[0.0] is need needed to keep the same dimension and goes at end. Second I am not taking in account NEUTRAL=0 position, all action with 0 should be replaced with 1 or -1. Another consideration is that with my wrong calculation I am simply shifting the action one step in the future, giving this algo more capacity to predict the future. That means that if we don't get good results anymore it could be due to the fact that we have a lag in the prediction. Let me change the code and try it again.

mg64ve avatar Jan 15 '20 08:01 mg64ve

ok @ScrapeWithYuri it is now fixed. The two calculations are now getting similar results:

image

The problem is, we are not getting good results. From what I can see, it seems it is still changing position too much:

image

and of course it has a lag, and I wrote earlier. I am going to try it now with bigger cost to see what happens. Please have a look to my code and let me know what you think about it. Thanks.

mg64ve avatar Jan 15 '20 08:01 mg64ve

Another interesting observation is the following. With the respect to the following part of the paper, we are considering a fixed value of alpha at the moment:

image

and this is "turn" parameter:

        var_defaults = {
            "margin": 0.01,
            "turn": 0.001,
            "ce": False,
            "dp": False,
            "reset_margin": True,
        }

Now if this is fixed, why do we have a different degrees of gradient in the following:

image

Gradient is almost the same apart some few cases. What do you think?

mg64ve avatar Jan 15 '20 08:01 mg64ve

@mg64ve Take a look at this file. I've added three PnL scenarios. I think your result was good, because it was looking backwards. I've added a second scenario (the program buys / sells short at time t+1 and closes the position at time t+2) and a third scenario (more in line with the paper). These additional scenarios do not have strong results. Let me know your thoughts.

PnL.xlsx

personal-coding avatar Jan 15 '20 17:01 personal-coding

Hello @ScrapeWithYuri and @mg64ve ,

I am very happy to see you sharing results and having fruitful discussions about it! I am sorry I haven't managed to respond as much as I wanted in the thread but as you correctly guessed, I am busy with work.

I just wanted to point out that the reason sometime you might see a different angle could be due to the parameter RESET_FROM_MARGIN. If this is true then every time the agent is "out of the margin", its position will get reset to to the next value. By quickly looking at your plot, it looks like this could be the case, where in timestep 436 the agent's value seems to be "out of the margin" / higher than the 0.01 margin and thus ignoring the action and resetting its position to a closer value.

A few words about this parameter: This was a parameter that was added later in development in order to help the agent to get unstuck from "bad scenarios" (=the agent's value has diverged so far from the exchange rate value that even correct actions might have negative rewards) during training. This can happen relatively often in the early stages of training due to the high exploration rate and because the agent has yet to find a good policy. This can also be combated with shorter length episodes / epochs. However, this parameter might also add confusion if this happen very often, so be aware that it might not be helpful in every scenario. wrong trajectory

Kostis-S-Z avatar Jan 15 '20 18:01 Kostis-S-Z

@mg64ve I made the below change to your pnl_of_trades code (not the most elegant approach). It's moving the trade returns forward by two time periods. The idea being you'd close out the position at t+2 for a trade you made at time t+1.

    def pnl_of_trades(self, env_type, actions, values, slippage=0.0):
        """
        Function to calculate PnL based on trades
        """
        #warnings.warn("No method implemented to calculate the PnL! Returning zero...", Warning)
        #return 0
        #prices_diff=np.concatenate([[0.0],np.diff(values)])
        #pnl=np.cumsum([actions*prices_diff])
        prices_diff = np.concatenate([[0.0], np.diff(values)])
        prices_diff = np.delete(prices_diff, 0)
        prices_diff = np.insert(prices_diff, len(prices_diff) - 1, 0)
        prices_diff = np.delete(prices_diff, 0)
        prices_diff = np.insert(prices_diff, len(prices_diff) - 1, 0)
        pnl = np.cumsum([actions * prices_diff])
        plt.plot_profit(self.folder,pnl,np.cumsum(self.epoch_profit),values,actions)
        return(pnl[-1])

The resulting profit is positive, but not high enough to warrant a trading strategy and well below the paper's results. Granted, the paper is not taking this approach. It's making a trade at time t+1, then changing positions only when the bot thinks there will be a shift in the trend. In any event, I wouldn't expect the outcome to be materially different.

image

personal-coding avatar Jan 16 '20 01:01 personal-coding