Working with RL
Hi !
My goal is to build a Multi-agent Reinforcement Learning model for a fluid simulation (https://github.com/ExtremeFLOW/neko) in Fortran. For this I have 3 state vectors (s1, s2, s3) and a relative error based reward function.
How should I dimension my state arrays for compatibility with TorchFort subroutines? Currently my state vectors are having dimension = number_of_agents. Also is it compulsory to have batch_size as one of the dimensions? for e.g. (number_of_agents, batch_size) ?
To just get a big picture, can you please let me know how to design this framework? (initialization, training, inference etc.) I want to use the off-policy methods like soft-actor-critic for this task.
Hello Sachin,
thanks for reaching out. Can you explain more about your problem? Are the state vectors all of equal shape or do they have different shapes? Concerning the number of agent, what do you mean by dimension = number of agents? Do you have 3 different agents or more? And are those agents sharing their weights or is that a case of MARL? Please describe your setup a bit so that it is easier for me to help you. Also, please tell me what the dimensions of the tensors are, for states, actions etc.
Best Thorsten
Thanks for your reply @azrael417 !
Basically I am trying to validate the work - (https://www.nature.com/articles/s41467-022-28957-7) on Wall-modelled Large Eddy Simulation.
S1(i), S2(i), S3(i), A(i)
S1, S2, S3 are the state vectors and A is the control action.
They are all one-dimensional vectors having the same size, i = n_nodes. Here, n_nodes refers to the total number of wall-adjacent mesh nodes (are nothing but the sampling points for the state vectors also where the control action needs to be applied). I am treating each of these n_nodes as an agent, thereby MARL system. These agents should be able to modify the same network (shared weights). For which I am planning to use the available soft-actor-critic model.
Pseudo code:
do t = 1 to timesteps
do i = 1 to n_nodes
Populate the state vectors
end do
Train the network using the state vectors S1(i), S2(i), S3(i) as inputs
do i = 1 to n_nodes
Use the control action received A(i) to set the boundary condition
end do
end do
Every time step, the sampled state vectors have to be given as an input to the soft-actor-critic model, the action coming as the model output will be used to set the boundary condition for the next time step of the simulation.
Hope this gives an overview of the problem.
P.S.: Also the do loop on n_nodes has domain decomposition enabled for parallel processing.
Hello Sachin,
this sounds like you can concatenate the three vectors together to form one vector, and on the model size take them apart. Basically create a new vector with dim (3,n_nodes) and then on the model side do torch.split on that. This also sounds like you can treat this as a 1D convolution, where 3 is number of channels. Does that make sense?
Afair, in that work they use a loop over sites and apply an MLP to it, but that can be represented with a 1D convolution of stride=1, kernel size=1, on a vector of size (B,3,n_nodes), where B is the batch size.
Best Thorsten
Thanks for your reply. Just to check if I got this right...
- If using MLP, state vector has to be of dim (3, n_nodes) and do a torch.split on the model side.
- If using conv1d, state vector has to be of dim (B, 3, n_nodes).
Are there any benefits in modelling it as a 1D convolution as opposed to MLP?
Right now I'm doing one time step at a time, and each timestep has n_nodes agents. Does the batch size refer to batching over time in this context?
Hello Sachin,
as far as I understand, the paper uses 3 input features, one from each flattened state vector, right? Then they apply an MLP for each of the n_nodes to these 3 input features. The weights of the MLP are shared between all agents (if every agent would have different weights, then the problem cannot be scaled out canonically). This is the same as doing a 1D convolution on a tensor of size (B, 3, n_nodes). So instead of doing something like this:
for I in range(n_nodes): a[I] = MLPModel(input[:,:,I])
you can do a = Conv1DModel(input)
where in both cases, a is of shape (B,n_nodes) and input of shape (B,3,n_nodes). It is the same, just the latter is vectorized and likely much more efficient (although speed is usually not the issue here). Since a multi D convolution with kernel size 1 and n_o output channels and n_i input channels, is the same as applying a matrix multiplication of a given matrix M of shape n_o, n_i to each grid point. The rest of the layers you need to structure accordingly, but since most relevant ones such as activations etc are element wise, there is nothing much to consider here.
One thing which is currently not easy to do and I need to think about how to expose it to fortran is to give a different reward to each agent. For the moment, just give the global reward to the rollout/replay buffer to get started. The C-interface has the functionality to deal with multi agents, where every agent gets a different reward, but it is not exposed to Fortran yet.
I will think about a good solution for that.
Best Thorsten
Thanks for the explanation.
As you've pointed out, they are giving a reward for every agent at location $(x,z)$. By global reward, do you mean something like a Mean Absolute Error (MAE) or Mean Squared Error (MSE) over the n_nodes?
Just for my reference can you please point out the C-interface function for creating individual agent reward system? Also if the individual agent reward for fortran would be implemented, I would like to be the beta tester !
Also I faced the following error during compilation:
Code: res = torchfort_rl_off_policy_predict(tf_key, state, action)
Error: There is no specific function for the generic ‘torchfort_rl_off_policy_predict’
While the following function worked by returning res=0:
res = torchfort_rl_off_policy_create_system(tf_key, yaml_path, model_device, rb_device)
print *, "result of torchfort_rl_off_policy_create_system: ", res
I suspect it may be some issue with the action argument which is dependent on the config.yaml / ML model which I haven't properly designed yet. Is there any error logging mechanism in TorchFort which can be used to know the root cause for this error?
Yes, you need to use some averaged reward over all the agents for the moment.
Regarding the error, the interface is only implemented for a handful of dimension combinations, because it is not really feasible to implement every possibility from our end. What you can do is check those specializations here:
https://github.com/NVIDIA/TorchFort/blob/40ae3d1b3aaed543433da66b47464e40889b0eed/src/fsrc/torchfort_m.F90#L2858
and following lines, and implement the one you need.
Hello Sachin,
I missed your last updates. Does everything work for you now? Also, we should support MARL via the multi-environment mechanism. In that case, you can specify n_env > 1 and then you need to pass the replay buffer a tensor with a certain format.
it is this line https://github.com/NVIDIA/TorchFort/blob/40ae3d1b3aaed543433da66b47464e40889b0eed/src/csrc/rl/off_policy/td3.cpp#L410
vs the one below. Basically, the first dim (or in fortran, the last dim) has to be of size n_envs. n_envs is a parameter you can specify to the replay buffer. If you don't specify it, it automatically assumes n_env = 1, otherwise you have to pass vectors for reward and terminal state indicator to the RB, where the shape of those vectors is (n_env,) in python notation.
Also, please note, all user interfacing buffers have to be allocated by the user,. torchfort never allocates buffers and transfers ownership to the users. All buffers allocated by torchfort are just for internal use. We decided to implement it that way, so that there is no situation in which Torchfort fiddles with the users buffers when it is not supposed to, not does it create a memory leak by producing new buffers the user is supposed to free.
Best regards Thorsten
Hi Thorsten!
Thanks for your message. I will definitely take a closer look at multi-env (n_env > 1) for my MARL use case & also memory allocation for buffers.
Regarding a previous issue I had posted and then removed—it turned out to be caused by shape mismatches in the action array. I’ve now implemented predict_float_2d_2d in TorchFort & recompiled. I'm passing a state array of shape (3, n_nodes) and an action array of shape (1, n_nodes). With this setup, the forward pass works, and I’m now getting actions clipped in the range [0.9, 1.1].
I am now looking to get state_old, action_old, state_new, reward, replay buffers & terminal flags.
Hi,
I have setup the RL training loop. I can see the forward and backward pass happening with the loss functions being calculated. It works well for the serial case. However, in parallel CFD simulations (e.g., with MPI), the number of agents (n_envs) equals the number of near-wall mesh nodes (n_nodes), which varies dynamically based on domain decomposition.
In the current setup, n_envs in the replay_buffer config must be set statically in the YAML file:
replay_buffer:
type: uniform
parameters:
max_size: <int>
min_size: <int>
n_envs: <int>
Is it possible to set n_envs directly from within the simulation environment instead of hardcoding it in the YAML file?
Thanks!
Hello Sachin,
good to hear it works for you and sorry for my silence, but we had a deadline which is now over. It is generally be possible to override that and pass it as an argument, but what I do currently is just doing something like this. Define the config_template.yaml file with:
replay_buffer:
type: uniform
parameters:
max_size: <int>
min_size: <int>
n_envs: NUM_ENVS
and then
num_agents=$(( ${nx} * ${ny} * ${nz} )) # those you can grep from the initial file, or however many agents you got
n_envs=$(( ${num_ranks} / ${num_agents} ))
sed 's|NUM_ENVS|${n_envs}lg' > config.yaml
I do that for stuff like batch size etc. The issue with exposing that as an additional argument to the constructor is that then one needs to implement a lot of accessor functions for all the argument and the interface can become quite cluttered. Please let me know if that helps you.
Best Thorsten
Hi @azrael417 !
Thanks for your answer. I will look into the method suggested.
On a side note, if I have 2 wall boundaries, both may have different number of agents depending on the discretization and the load balancing. Now, I have found a configuration with equal number of agents allocated for each of the walls and ranks.
Though my state, action, reward, etc which are used to update the replay buffer already have n_agents=n_nodes embedded in their shape as -
state: (features, n_nodes)
action: (1, n_nodes)
reward: (1, n_nodes)
Why is there a need to again set this under replay_buffer.n_envs?
in order for being able to specify a non scalar reward, you need to have n-envs > 1. So in your case, you would have something like
state: (features, n_nodes)
action: (1, n_nodes)
reward: (n_nodes)
and n_nodes == n_envs.
Hi !
This question is regarding the frequency of calling TorchFort subroutines.
Since there is a loop over timesteps and a loop over agents, which of the following is recommended? Calling the subroutines for every agent in every timestep? (or) Calling once every timestep?
For reference following is the code I want to use TorchFort with: https://github.com/ExtremeFLOW/neko/blob/release/0.9/src/wall_models/spalding.f90
Thanks.
Hi !
I want to use the algorithm "sac" soft actor critic for my test case. But I wanted to change to Softsign activation function between the layers. Hence used the attached python script and did a jit compilation. But I am getting the following error-
Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: #0 0x7f5ba7dcd2e2 in ??? #1 0x7f5ba7dcc475 in ??? #2 0x7f5ba7ab1dbf in ??? #3 0x7f5b9206bb98 in ??? #4 0x7f5ba8507322 in _ZN2at5clampERKNS_6TensorERKSt8optionalIN3c106ScalarEES8_ at /tmp/sachinbm/relexi310/env_relexi/lib64/python3.10/site-packages/torch/include/ATen/ops/clamp.h:27 #5 0x7f5ba8507322 in _ZN9torchfort2rl14GaussianPolicy16getDistribution_EN2at6TensorE at /tmp/sachinbm/TorchFort/src/csrc/rl/policy.cpp:68 #6 0x7f5ba85075cb in _ZN9torchfort2rl14GaussianPolicy12forwardNoiseEN2at6TensorE at /tmp/sachinbm/TorchFort/src/csrc/rl/policy.cpp:104 #7 0x7f5ba8541ec0 in _ZN9torchfort2rl10off_policy9train_sacIfEEvRKNS0_10PolicyPackERKSt6vectorINS_9ModelPackESaIS7_EESB_N2at6TensorESD_SD_SD_SD_RKSt10shared_ptrINS1_10AlphaModelEERKSE_IN5torch5optim9OptimizerEERKSE_INS_15BaseLRSchedulerEERKT_SV_SV_RST_SW_ at /tmp/sachinbm/TorchFort/src/csrc/include/internal/rl/off_policy/sac.h:150 #8 0x7f5ba853a534 in _ZN9torchfort2rl10off_policy9SACSystem9trainStepERfS3_ at /tmp/sachinbm/TorchFort/src/csrc/rl/off_policy/sac.cpp:621 #9 0x7f5ba850cc43 in torchfort_rl_off_policy_train_step at /tmp/sachinbm/TorchFort/src/csrc/rl/off_policy/interface.cpp:193 #10 0x7f5ba944bff0 in __torchfort_MOD_torchfort_rl_off_policy_train_step_float at /tmp/sachinbm/TorchFort/src/fsrc/torchfort_m.F90:2745 #11 0x49f90d in ??? #12 0x45b83c in ??? #13 0x45631e in ??? #14 0x5fc6c9 in ??? #15 0x4607ad in ??? #16 0x4076f2 in ??? #17 0x40773e in ??? #18 0x7f5ba7a9c24c in ??? #19 0x4047c9 in _start at ../sysdeps/x86_64/start.S:120 #20 0xffffffffffffffff in ??? Segmentation fault (core dumped)I don't understand the root cause of this error as I am not sure of the backend code. Please let me know how I should modify to mitigate this error. PFA the code and the yaml file.
By tracking the fact that SAC Model has a common encoder and two output layers for mu and log-sigma, I was able to resolve this error.
I see that the action returned by predict method will either be close to 0 or close to infinity even though I have used a_low = 0.9 and a_high = 1.1. Basically the output action is not getting clipped to the range I have specified. What may be the issue causing this behavior?
Currently the model is getting trained when the action is close to 0, but when it becomes infinitely large it stops learning. I'm almost close to starting the training if this issue gets solved.
Hello Sachin,
the action should not be inf, but if it gets inf in between the clipping will not work anymore. With SAC, you need to make sure your action space is small because of the log probability term. This one grows quickly out of bounds when the dimension of the action space is large. I only got it to work if the action tensor has a few elements, not more than 10. The error above suggests that there is a segfault somehow, when it is trying to generate the random numbers for the action. I would expect that there is a shape mismatch somewhere. I would go through the original model and print all the shapes and then in your model, also print all the shapes. Especially, the gaussian tensor has to have the same shape as the action space afaiao.
Best Thorsten
Concerning the loop over agents, you do not need that, just do a 1D convolution with kernel size = 1 and it will implicitly do a loop over agents, if every entry in the 1D array is a different wall point.
And also, if you change the scaling from tanh to soft sign, then you also need to implement the log probs for that in the backend. The process is called squashing and for different activations you need to implement your own squashing. For the tanh case, there is an analytic formula how to get the log prob for a squashed gaussian, for soft sign I am not sure. In that case the code will default to some approximate method. I need to revisit that part but this is tricky to get to work.
For example, stable baselines has this here:
https://github.com/DLR-RM/stable-baselines3/blob/6af0601dc3c91d5e5477f10ca6976e77a24cccc8/stable_baselines3/sac/policies.py#L172
and that routine needs to be implemented for different squashings. Torch has analytic functions for some but not all squashings.
Hi Thorsten !
Thank you very much!
Shape mismatch -
The following warning was always appearing. Has this something to do with the shape mismatch?
[W604 11:19:06.156644710 loss.h:109] Warning: Using a target size ([512, 512]) that is different to the input size ([512]). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. (function mse_loss)
Basically the critic network, which is supposed to predict a single Q-value for each state-action pair in the batch, is unexpectedly outputting a [batch_size, batch_size] matrix instead of a [batch_size, 1] vector (or just [batch_size]).
Activation function & changing the squashing function-
In order to preserve the squashing function for tanh, now I plan as follows - input layer -> Hidden Layer (1) -> Softsign -> Hidden Layer (2) -> Tanh -> logprob & mu Hope I can have the above architecture, to avoid changing the squashings.
As per your suggestions, I was checking the shapes of the tensors. Following are the results.
[DEBUG] SACMLPModel::forward
inputs: 1
x: [11664, 3]
x: [11664, 3]
y: [11664, 1]
z: [11664, 1]
[DEBUG] forwardDeterministic Tensor shapes:
action: [11664, 1]
[DEBUG] SACSystem::predict
action : [11664, 1]
a_min : -nan
[ CPUFloatType{} ]
a_max : -nan
[ CPUFloatType{} ]
result of predict_float_2d_2d: 0
[DEBUG] Tensor shapes:
s: [11664, 3]
a: [11664, 1]
sp: [11664, 3]
r: [11664, 1]
d: [11664, 1]
result of update_replay_buffer_multi: 0 result of policy_is_ready: 0
[DEBUG] SACMLPModel::forward
inputs: 1
x: [11664, 3]
x: [11664, 3]
y: [11664, 1]
z: [11664, 1]
[DEBUG] getDistribution_ Tensor shapes:
action_sigma: [11664, 1]
action_mu: [11664, 1]
[DEBUG] forwardNoise Tensor shapes:
action: [11664, 1]
log_prob: [11664]
[DEBUG] reward_tensor shape: [11664, 1]
[DEBUG] q_new_tensor shape: [11664]
[DEBUG] d_tensor shape: [11664, 1]
[DEBUG] q_old_tensor shape: [11664]
[DEBUG] y_tensor shape: [11664]
The action and log_prob shapes are different and the tensor::min(action) and tensor::max(action) are resulting in NaN values.
Also I am seeing that my SACMLP network outputs before passing into the squashing functions are extremely large.
y_min : -9.86553e+18
[ CPUFloatType{} ]
y_max : -9.86553e+18
[ CPUFloatType{} ]
z_min : 2.6944e+18
[ CPUFloatType{} ]
z_max : 2.6944e+18
And the inputs itself is giving NaN values when printed from predict function in off_policy.h:
[DEBUG] off_policy.h - predict
state_tensor : [11664, 3]
min : nan
[ CPUFloatType{} ]
max : nan
[ CPUFloatType{} ]
The shape of the state_tensor is right but the values are getting corrupted. With reference to this, following is the Fortran call I am referring to:
res = torchfort_rl_off_policy_predict(tf_key, transpose(this%state), transpose(this%action))
How to trace what is happening. Looking forward to your inputs.
Yes, so your action space is 11664 dimensional? That means that the log prob will likely explode. For methods which use log prob (like PPO and SAC) you need to narrow down the action space somehow. In the original paper you werte trying to reproduce, they treated the agents for each wall point independently, so log prob will be computed based on a single number. Once you feed the whole tensor, it will blow up. What you need to do in that case treat the setup as 11664 independent environments, so set n_env to 11664 and then ensure that you pass the rewards correctly.You need to pass as many rewards as you have n_envs and also the network should now consider the n_env dim as batch size dim. So you input will be [BN_env,
Thank you! Currently I am using n_envs=11664 itself. I started to trace what is happening using the debugger and following was the result-
I see that there are some huge values coming into the state array:
(gdb) x /100f (float*)state
0x11bd4ae0: -1.18063349e-15 1.78578556 -1.91713117e+20 -2.03346205
0x11bd4af0: 6.21707625e-18 1.12706995 4.98179441e+16 1.78507316
0x11bd4b00: 5.66422824e-14 -2.0317328 6.21707625e-18 1.12706995
0x11bd4b10: 9.52281712e-33 1.78256297 51440880 -2.02667618
0x11bd4b20: 6.21707625e-18 1.12706995 1.01713975e+29 1.78262413
This is where I am calling the predict function from my simulation-
this%state_transposed = transpose(this%state)
print *, ' Shape State : ', shape(this%state_transposed)
print *, ' Min value : ', minval(this%state_transposed)
print *, ' Max value : ', maxval(this%state_transposed)
res = torchfort_rl_off_policy_predict(tf_key, this%state_transposed, this%action_transposed)
These are the min and max values of my state array when printed from fortran:
Shape State : 3 11664
Min value : -0.13832665232620783
Max value : 0.67856877986930808
If I do not transpose my state array I was getting this error:
what(): mat1 and mat2 shapes cannot be multiplied (3x11664 and 3x128)
Any inputs on why these huge values are coming into my state array? I feel this is the reason for the NaN values obtained in action and losses.
All the shapes are mentioned are the ones in pytorch, but since fortran is column major, you need to transpose the stuff you are passing, so basically [B, C] would become [C, B] on the fortran side, and then internally it will be handled correctly.
Large values in the state array though should not come from torchfort, it means that something else went wrong before. Did you warm up the simulation before you apply actuation to it? What I usually do is:
- run the simulation as-is (unactuated) for some number of steps
- run the simulation for another number of steps, computing mean, standard deviation of some of the input fields you want to use (for example wall shear for state or so)
- after the stats have been computed, switch on the actuation but instead of state, pass (state - mean) / std to the policy network for prediction.
- it makes sense to also normalize the reward to the unactuated baseline. So basically compute the mean and std of the reward in phase 2 and then use that to compute a normalized reward during training.
If you see large values popping up then, this is an indication that some of the agents are going haywire. But action clipping should help, it should not go out of bound quickly.
Here I have not applied the actions to influence the environment. I just wanted to check the reason for getting large actions by just calling the torchfort_rl_off_policy_predict function & by switching-off all other functions. It seems to be some garbage values or some wrong address is being accessed by the state... Some issue in how the Fortran array is passed to the interface maybe...
I ran the simulation for sufficient time to get a fully-developed turbulent flow & from this checkpoint I started to call the torchfort routines to train the model - first, without using the output actions to influence the environment for certain time and then use it to influence the environment.
I have normalized my state array with respect to the friction velocity & viscosity. On top of this, do I need to normalize with respect to the statistics like mean & std ?
Do I need to destroy the torchfort system which is created initially? Has this something to do with the garbage values?
I had created the above function and added it to the interface-
interface torchfort_rl_off_policy_predict
module procedure torchfort_rl_off_policy_predict_float_2d_2d
And in this I have real(real64) :: state(:, :), act(:,:). In the C binding torchfort_rl_off_policy_predict_c, there is real(c_float) :: state(*), act(*). Does this have to do anything with this?
After changing to c_double it is working. The actions are within the clipped range as well. Thanks for your support!