EOgmaNeo icon indicating copy to clipboard operation
EOgmaNeo copied to clipboard

Help needed for reinforcement learning problem

Open dsimba opened this issue 5 years ago • 17 comments

Hi there,

i would like to use EOgmaNeo and apply it to a reinforcement learning problem. I have inputs as a vector at each time step. But I don't know how to pass these inputs to the learning agent, i.e EOgmaNeo and get the action.

Please any help appretiated.

Thanks

dsimba avatar Aug 05 '18 11:08 dsimba

Hi!

RL is still an experimental feature, so I would recommend getting the latest from the "Pre-Release" repository over here: https://github.com/222464/EOgmaNeo

I have prepared an OpenAI Gym Cart-Pole demo. This should be helpful for figuring out how to convert to a columnar SDR. It does it manually by simply rescaling the input observations into the integer range of the input columns.

Quick rundown of the input format: The user defines how many input layers they want, and each input layer is a 2D grid of columns. Each column is a one-hot vector, so it is represented by a single integer index. This makes the tensor of the shape numInputLayers X width X height, where each element is an integer representing the index into a column.

How you transform into this representation is up to you. Preferably you will want to make it locally sensitive and preserve as much information as possible, but sometimes this isn't necessary and it will work well regardless. One of the simplest encodings is to simply take e.g. a scalar and rescale it into the discrete integer range [0, inputColumnSize). This is an inefficient encoding usually but if also works quite well and is simple to implement. This is what the Cart-Pole demo uses.

For tasks with image input, you will probably want to use one of the pre-made pre-encoders, like the ImageEncoder. It is also possible to encode completely arbitrary data with the KMeansEncoder, but it often requires pre-training since it can forget.

Actions are treated the same way inputs are, you feed in the last action taken as an input each timestep. You then retrieve the next action as a prediction, using getPredictions(...).

Here is the Cart-Pole example:

# ----------------------------------------------------------------------------
#  EOgmaNeo
#  Copyright(c) 2017-2018 Ogma Intelligent Systems Corp. All rights reserved.
#
#  This copy of EOgmaNeo is licensed to you under the terms described
#  in the EOGMANEO_LICENSE.md file included in this distribution.
# ----------------------------------------------------------------------------

# -*- coding: utf-8 -*-

import eogmaneo
import numpy as np
import matplotlib.pyplot as plt
import gym
from gym import wrappers
from copy import copy
from copy import deepcopy

env = gym.make('CartPole-v0')

env = wrappers.Monitor(env, '/tmp/cartpole-experiment-1', force=True)

# Create hierarchy
cs = eogmaneo.ComputeSystem(4)

# Layer descriptors
lds = []

for l in range(6): # Layers
    ld = eogmaneo.LayerDesc()

    # Set some parameters
    ld._width = 5
    ld._height = 5
    ld._columnSize = 32
    ld._forwardRadius = ld._backwardRadius = 2
    ld._temporalHorizon = 2
    ld._ticksPerUpdate = 2

    lds.append(ld)

h = eogmaneo.Hierarchy()

inputRes = 32 # Number of bins to rescale inputs into. This is the resolution of the input columns (input column size)

# 4 input scalars, 1 binary action
h.create([ (2, 2), (1, 1) ], [ inputRes, 2 ], [ False, True ], lds, 123)

# Set parameters
for i in range(len(lds)):
    l = h.getLayer(i)
    l._alpha = 0.01
    l._beta = 0.01
    l._gamma = 0.9
    l._maxReplaySamples = 32 # Equivalent to a finite length eligibility trace
    l._codeIters = 4
    
# Bounds for rescaling inputs into discrete columns
obsMin = [ -0.25, -0.25, -0.25, -0.25 ]
obsMax = [ 0.25, 0.25, 0.25, 0.25 ]

useReward = 0.0

action = 0

for ep in range(5000):
    observation = env.reset()

    for t in range(1000):
        #env.render()

        # Rescale and discretize inputs into seperate IO layers
        obsSDR = []

        for i in range(4):
            obsSDR.append(int((min(obsMax[i], max(obsMin[i], observation[i])) - obsMin[i]) / (obsMax[i] - obsMin[i]) * (inputRes - 1) + 0.5)) # Rescale from [obsMax, obsMin] range to discrete range [0, inputRes)

        h.step(cs, [ obsSDR, [ action ] ], useReward, True)

        # Retrieve action (2nd IO layer)
        action = h.getPredictions(1)[0] # First and only entry

        # Exploration
        if np.random.rand() < 0.01: # Epsilon-greedy
            action = np.random.randint(0, 2)

        observation, reward, done, info = env.step(action)

        # Reward is simply a negative one for when the episode ends (task is to maximize episode length, which balances the pole)
        useReward = -1.0 * float(done)

        if done:
            print("Episode {} finished after {} timesteps.".format(ep, t + 1))
            
            break

222464 avatar Aug 05 '18 13:08 222464

Thanks a lot for this explanation and example with cartpole problem. I think I got the idea of what I need to do from here.

But just to confirm, so the action will be alwaya retrieved as per below, even if my problem has different input dimension, input layers and number of action for my RL problem, correct?

action = h.getPredictions(1)[0]

Thanks

dsimba avatar Aug 06 '18 09:08 dsimba

No, the index in the getPredictions(index) call defines which input layer is being retrieved. You can treat any input layer as an action layer - so if you have, say, 3 input layers for an RGB image, and then 1 additional input layer for an action, there are 4 input layers in total. You would retrieve the action with getPredictions(3) in this case (may differ depending on the order of the layers you choose).

It can be thought of this way: Every input column also has a action column (aka prediction) as long as the predict flag is set to True (the [ False, True ] list in h.create). If you wanted to, you could just set all of them to being predicted (all True), and then just ignore the actions in the layers you don't care about. The predict flag is just for optimization. Actions are encoded and decoded in the same way as regular inputs. So, if you want a continuous action, you need to encode and decode the columns as scalars (as we did for the observation in the Cart-Pole example above). In the Cart-Pole example, since the action is discrete, we simply make the action input layer a single column where the one-hot column represents the action we wish to take.

Oh, and also: If you treat something as an action input layer, make sure you pass in the last action taken to step(...). You cannot just read actions with getPredictions without telling it which action it actually took previously in the step(...) call. This is done so that you can manually apply any sort of exploration that you want. So really, the only thing that sets apart an "action" input layer from a "observation" input layer is that in the action layer you give it the last action taken as input while in an observation layer you give it the sensory input.

I know this is probably confusing. I think we could make some better diagrams to explain this!

222464 avatar Aug 06 '18 13:08 222464

OK, this helps a lot again settle the dust in my mind :-). I initially tried to read the published paper by Ogma about this. Would you mind sharing a bit more explanation about the relation between layerDesc, input layer and the exponential memory? Si there some rule of thumb between the number of timesteps I want the agent to remember and the number of layerDescr?

Thanks

dsimba avatar Aug 06 '18 14:08 dsimba

Sure. The paper is a bit old now, it didn't have exponential memory in it yet if I recall.

An input layer is the IO mechanism of the hierarchy. These input layers can be viewed as multiple parallel input/output channels, kind of like in a convnet. These are the layers the user directly interacts with.

LayerDescs are descriptions of "higher" layers - the layers the user doesn't directly interact with. These are kind of like "hidden" layers in a regular deep net, and LayerDescs describe these with e.g. dimensions, connective radii, clock information.

Each LayerDesc has two timing-based parameters in it: The temporalHorizon and the ticksPerUpdate. ticksPerUpdate is basically the exponential memory slowdown multiplier - it is typically set to 2 to give an exponential 2^N timesteps of memory (where N is the number of higher layers). temporalHorizon is the window each layer has on the layer(s) below, and must be at least ticksPerUpdate, but can be more (although generally it is also set to 2).

temporalHorizon and ticksPerUpdate both default to 2, so if you want e.g. 32 timesteps of memory, using these default settings you need 5 layers. This means there are also 5 LayerDescs (since there is one for each higher layer). More cannot hurt, due to the bidirectional nature of the hierarchy. Of course, since it is exponential, you generally will not need more than 12 layers or so.

222464 avatar Aug 06 '18 15:08 222464

Ok, thanks for this extra clarification. I will try this out.

dsimba avatar Aug 06 '18 23:08 dsimba

Hi,

I tried by just adapting the example above you provided with cartpole.

The challenge I am facing now is that the learning agent, though learns the environment behavior does not change drastically, failed to adapt to sudden long term change.

Are there some hyperparameters (alpha ?, gamma?, beta? coderIters?, etc) I should tweak to see if adaptation to change in environment state is faster?

Thanks

dsimba avatar Aug 18 '18 09:08 dsimba

I would indeed start by trying some different hyperparameters.

alpha - feed forward learning rate (encoder, typically between 0.001 and 0.5) beta - Q learning rate (decoder, typically between 0.001 and 0.5) gamma - Q discount factor (decoder, typically between 0.9 and 0.999) codeIters - number of sparse coding solver iterations (more = slower but better result, typically between 2 and 8)

A higher learning rate (for both alpha and beta) will generally cause it to adapt faster. A lower gamma will generally be more stable, but will cause the layer to discount future rewards more (more layers can counter this as well).

Can you share some details on the new environment?

222464 avatar Aug 18 '18 14:08 222464

Sure. I am doing an experimental work of early warning system. Actually, it is the same environment. I training the agent to learn defectuous situation based on sensory data (monitoring of solar power plant), but just looking at the performance of the agent over the whole set of data, I saw that it is slow to adapt to defectuous situations.

dsimba avatar Aug 20 '18 06:08 dsimba

OK, interesting! Let me know if you are able to share the environment, that way I could help tune it.

I just pushed a new version that seems to perform much better on most tasks (it is pushed to 222464/EOgmaNeo, the pre-release repository). I changed the weights to only decrease as in ART, which reduces forgetting to basically none at all, but can cause it to occasionally have difficulty separating similar inputs into different SDRs. Now this would sound like it would have even more trouble adapting to new situations, but I don't think this is the case. Only experimentation will tell though!

222464 avatar Aug 20 '18 15:08 222464

Hi,

I am able to share some data with you about the environment. It will consist of timetsamp id (ascending list of int) associated with an int [0,32) that is the result of the encoding I chose. How may I share that with you?

Thanks

dsimba avatar Sep 16 '18 15:09 dsimba

Hello,

You can send it by email if you like: [email protected] If it's too large, you can put it in dropbox/google drive etc and link it in the email.

222464 avatar Sep 16 '18 17:09 222464

Thank you. I just sent you an email where the subject is the title of this github issue.

Also, I have one more question actually. As in my case, my sample has about 20k points. So I tried to training over many epochs (each epoch goes uses the entire dataset). But one thing I could not find was a reset function which would clear "the data context" before starting a new epoch. Is that something that makes sense to you with regards to your library.

dsimba avatar Sep 16 '18 18:09 dsimba

Thanks, I have received the data.

I am not sure how exactly reinforcement learning is applied to this, this seems more like a case for anomaly detection, which can by done with the regular master branch by learning to predict the data and flagging an anomaly if the difference between the prediction and the actual data becomes too large.

In regards to your question, we can definitely add a "Zero-context" feature. I'll add it!

222464 avatar Sep 16 '18 18:09 222464

Thanks. My intent in applying reinforcement learning to the case here is to provide feedback/reward in case where the predicted anomaly comes to be a true positive.

Anomaly detection is not enough in my use case as we would like to rely on the prediction to start operations maintenance. Hope that makes sense in term of explanation.

dsimba avatar Sep 16 '18 18:09 dsimba

OK, I don't know the details so I'll take your word for it :)

I just added a (untested) zeroContext function "h.zeroContext()" in 222464/EOgmaNeo:RL - let me know if it works!

222464 avatar Sep 16 '18 18:09 222464

Ok thanks I will check that out.

About my final goal, we have support team that need to go to remote areas to investigate possible issues. So I am looking into cutting down the time delay between defects and the arrival of support team. More time it takes for intervention means a penalty to us. So the online learning ability of this framework + RL makes it very appealing as a solution to me.

dsimba avatar Sep 16 '18 18:09 dsimba