TorchFort Running simulations in parallel

Hi ! Is it possible to run multiple simulations in parallel, all updating the same policy / critic network using TorchFort? If so can you please share some insights into this? Currently I'm launching one simulation using one set of - (binary, case file) What if there are multiple of these (binary, case file), but having the same yaml file and torchscript?

Aug 12 '25 11:08 SachinBM-CE

Hi @SachinBM-CE, running separate binaries in parallel will not update the same model as you describe. Each process would separately load the TorchScript model into memory and train it within the process independently. To train in parallel, you need to use the distributed RL system creation routines (e.g., https://nvidia.github.io/TorchFort/api/c_api.html#torchfort-rl-on-policy-create-distributed-system) and possibly also the _multi variants of the RL training routines if you want to have multiple "environments" per process (e.g., https://nvidia.github.io/TorchFort/api/c_api.html#torchfort-rl-on-policy-update-rollout-buffer-multi).

@azrael417 might be better able to handle follow up questions on this one but he is out of the office until next week.

Aug 13 '25 19:08 romerojosh

Hi @romerojosh !

Thanks for your reply. As a future step, I wanted to run multiple CFD simulations, each having multiple agents, to train the same RL model.

Otherwise, for training on a single CFD sim -

As per previous discussions, I have been using off_policy_update_replay_buffer_multi for training multiple agents in the same CFD simulation. Hope multiple environments and multi-agent RL mean the same as per TorchFort's lingo?
Is off_policy_create_distributed_system to be used when running with mpirun -np 4 ./binary? The other variant - off_policy_create_system cannot be used in this scenario?

Aug 13 '25 20:08 SachinBM-CE

I believe it is correct that "multi-env" and "multi-agent" are the same concept, but @azrael417 will have to follow up here as I might be missing some subtle difference between the two concepts.
Depends on what you are attempting to do. You need to use create_distributed_system for TorchFort to create a data-parallel model and share gradients across the ranks in the MPI communicator passed to that API. If for some reason you wanted to run multiple independent models concurrently, you can use the non-distributed API even in an MPI launched run with multiple processes.

Aug 14 '25 18:08 romerojosh

I will wait for a response from @azrael417 regarding the multi-env / multi-agent.

By running multiple independent models concurrently, do you mean each MPI process is having it's own copy of the neural network?
Is update_replay_buffer also affected by create_distributed_system ? Is the replay buffer having it's local replay buffer corresponding to each MPI Process?
How does the train_step vary with create_distributed_system and create_system?

The update_replay_buffer is checking if the batch_dimension is equal or not to n_envs parameter. The batch_dimension which is also equal to the number of agents may vary depending on the number of MPI processes and different zones of the domain. But what remains fixed is the total number of agents across all processes or zones.

For example: Suppose if 2 MPI processes are launched, following data is observed

----------------------------------------
Rank, boundary zone, number of agents
----------------------------------------
   0,             1,               m
   0,             2,               n
   1,             1,               m
   1,             2,               n

Should I do a MPI_Gather operation, to collect (s,a,r,s') from all the processes/zones whose sum total is always fixed, which can be used directly to set the n_envs parameter in the config.yaml? (OR) is there any functionality in TorchFort that can help in this scenario?

P.S.: Just for context, the dimension of s, a, r, s' are respectively - (number of agents, number of features), (number of agents, 1), (number of agents) and (number of agents, number of features). Here the number of agents can be equal to m or n depending on the boundary zone.

Aug 19 '25 11:08 SachinBM-CE

Hello Sachin,

multi does not refer to multi agent but instead multi environment. It allows you to run multiple environments in a single task, but those apply to the same agents. In order to achieve multi agents, you should be able to take care of the independent agents in the backend. What is the actual setup you are using? Do you for example have two boundary layers and want to apply a different agent to each boundary layer? What is the actual model for each agent? An MLP acting on a feature vector and you want to have a different MLP for each wallpoint?

Concerning the gather you do not need to do that, if you initialized the distributed system accordingly all the gradient communication should be taken care of automatically. But before we go there, we need to understand what you are trying to achieve. For example for each agent, you might to create its own distributed system since inside a distributed system, all gradients are shared (which is what you want to avoid when running independent agents).

Best Thorsten

Aug 19 '25 12:08 azrael417

I am trying to reproduce https://www.nature.com/articles/s41467-022-28957-7 using TorchFort and https://github.com/ExtremeFLOW/neko.

From our previous discussions (https://github.com/NVIDIA/TorchFort/issues/48), it was suggested to use MARL via the multi-environment mechanism.

The setup is for wall-modelled LES in a channel flow (two parallel walls). The near wall grid-points act as individual agents (number of agents = number of near-wall grid points). I want to train a single MLP / wall model using multi-agent RL. All the agents contribute to a single shared policy i.e. a single MLP / wall model.

Basically in this paper, they launch multiple CFD runs (with different Reynolds number) and each CFD run has the same number of agents together updating a single MLP / wall-model.

Aug 19 '25 12:08 SachinBM-CE

I am using the following functions currently: torchfort_rl_off_policy_create_system, predict, update_replay_buffer, train_step. I am passing s,a,r,s' with their leading dimension equal to number of agents and giving n_envs parameter under replay_buffer in the yaml file. Is this not the way to work with MARL?

Aug 19 '25 17:08 SachinBM-CE

If you want to share model information, then you should use the distributed system variant. In this case, you can use number of environments == number of agents and pass the arrays/tensors as you suggested accordingly. If you run multi-GPU, and your agents (which share the same weights) live on different GPUs, you need to use the distributed variant. Now you say that you are in a situation where m!=n for different ranks? That is a bit of a more complicated scenario. Is it possible to ensure that the number of environments is the same for each rank, i.e. for example (n+m)/2 or something like that?

Aug 20 '25 15:08 azrael417

In this case, you can just use MPI to re-distribute the data beforehand, or, you can also split up the volume evenly so that each rank/GPU has the same number of wall points.

Aug 20 '25 15:08 azrael417

Just to clarify, if I want to run on the HPC cluster with just CPUs cores, Suppose, number of compute nodes per job = 1 (single node) and CPU cores per node = N, I should be using create_distributed_system, hope I am right...

Regarding the equal distribution of agents to the ranks, I need to try, but it is not so easy as ParMETIS does the mesh partitioning and I do not know how I can change that...

Additionally, when does the need arise to change the key variable used in these functions? As per my understanding of the documentation, key is a unique identifier for the neural network model...

Aug 21 '25 18:08 SachinBM-CE

Hello Sachin,

if you cannot influence the partitioning then you can use all-to-all or gather/scatter to re-distribute the data between the nodes. However, I would probably not invest too much time in finding a good algorithm for that. I assume that the simulation cost is much higher than the cost for training the NN, so you can gather all data on rank 0 and run a non-distributed version for the RL algorithm there.

With key, you mean the first parameter which goes into all the torchfort functions? This should be unique to the system. For example in the distributed case, only systems with the same key communicate with each other and share the gradients. If the key is different on different ranks, it will not work.

In any case, I think it is easier to use a non distributed setup in your case and simply mpi gather all the data to rank 0 and train the model there. For agent step, just do a local step and broadcast/scatter the action prediction. I am almost certain it would not have a big negative impact on overall performance compared to the case where you balance it out, since that involves a lot of edge case handling depending on how ParMETIS decides to split things up. Does that make sense?

Best and thanks Thorsten

Aug 22 '25 05:08 azrael417

It is also easier to debug in case something does not work as expected, since you only have to print replay/rollout buffer entries from that rank then and can dump everything locally.

Aug 22 '25 05:08 azrael417

Hi Thorsten, The MPI_Gatherv implementation on rank 0 worked ! By collecting global (s,a,r,s') arrays from all ranks, I've eliminated the uneven agent distribution problem entirely. The replay buffer updates now handle the complete dataset regardless of how agents are spread across processes. Thanks for the guidance on this solution!

Aug 25 '25 16:08 SachinBM-CE

I am collecting 100 timesteps of simulation data (s_1, a_1 , r_1, s'_1) ... (s_100, a_100, r_100, s'_100) which is equivalent to one episode (as terminal state is reached) and after which I want to train the MLP.

How do I set the epochs and number of gradient updates for training? Currently I have placed the call to train_step function inside a do loop. How is it ensured that entire episode data is trained?

Aug 25 '25 18:08 SachinBM-CE

Which algorithm is that, some on or off policy one? In both cases, a call to train step only samples a single batch from the buffer (replay or rollout), with the difference that in the PPO case it will go through the episodes chronologically sorted and in case of off policy just sample randomly. If you want to make sure that you train a full episode, store exactly one episode (adjust buffer size) and then wrap it into a do loop with 100 train steps. However, for off policy that is not necessarily a good idea, I would definitely make the buffer big enough to store multiple episodes so that the algorithm can adjust also on past trajectories. I am not an expert here, this is something you need to play around with. number of gradient updates is gradient accumulation, which is used to mimic larger batch sizes. I would just increase batch size instead.

Aug 26 '25 07:08 azrael417

sure, I'm using off-policy Soft-Actor-Critic for my training. I'm having some problems relating to normalization of the state and reward... We had previously discussed about this (https://github.com/NVIDIA/TorchFort/issues/48#issuecomment-2969989589)... Suppose my states are (u^+, du^+/dy^+, y^+) all in wall units and I have some relative error based reward function, do I have to take the running mean and standard deviation of these quantities over certain time frame? How is it usually done?

Aug 28 '25 16:08 SachinBM-CE

Hello Sachin, this is part of the research I guess. However, I would refrain from using changing normalizations because especially with off policy algorithms, you want to be able to compare old state transitions to newer ones and if you change normalizations in between (which you would do if you use running stats) then the comparison is not meaningful.

In that sense, I would thermalize the system without actuation (basically get in in a steady state), compute all stats there (for mean/var of the fields and mean of the reward), then use those throughout the entire training.

If you would do some fine-tuning later, you could think about recomputing those on the actuated system, but I would start with the former.

Aug 29 '25 07:08 azrael417

Hi ! I was trying to run my application on HPC, but I get the following segmentation fault -

Thread 1 "neko" received signal SIGSEGV, Segmentation fault.
0x00001555430cfb33 in mkl_blas_avx512_dgemv_t_intrinsics () from /path/to/env/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-251.el8_10.25.x86_64

I did not face this issue on my local machine, but it occurs on the HPC. I loaded the following modules: module load python/3.12-conda gcc/11.2.0 openmpi/4.1.2-gcc11.2.0 cmake/3.24.3 hdf5/1.14.5-ompi-gcc mkl/2021.4.0

Sep 10 '25 18:09 SachinBM-CE

Regarding the previous error, all I could find was -

On HPC, the simulation binary is linked to mkl/2023.2.0 (libmkl_gf_lp64.so.2, libmkl_sequential.so.2, libmkl_core.so.2) and PyTorch to MKL-2024.2.

On my local machine the simulation binary is linked to libblas.so.3 / liblapack.so.3 and PyTorch is using the identical MKL-2024.2.

Is libblas and libmkl mismatch causing this error? It runs without any errors on my local machine but gives the above error on HPC.

Sep 12 '25 14:09 SachinBM-CE

❌ Build workflow failed! View run

Sep 12 '25 14:09 github-actions[bot]

Hi !

AVX512 :

My HPC cluster nodes support the AVX512 instruction set. However, due to a limitation I suspect is related to PyTorch's build, I am forced to downgrade to AVX2 by setting export MKL_ENABLE_INSTRUCTIONS=AVX2 to run my training jobs. I am considering trying the Intel Compilers instead of the default GNU toolchain to see if I can leverage the full AVX512 capabilities. Have you faced this issue?

Training :

The training started well, but after a certain number of timesteps, the policy loss began to diverge, causing the reward to decrease, as shown in the training curve below. I am currently tuning hyperparameters to achieve a stable policy where the reward plateaus, but I'm open to suggestions on what might be causing this instability.

Error while using PPO :

I also wanted to check if PPO does any better job, and hence I setup the on-policy system and it's functions. But I get the following error -

terminate called after throwing an instance of 'c10::Error'
  what():  The size of tensor a (8) must match the size of tensor b (2) at non-singleton dimension 1

Also please find attached my config file.

config_ppo.yaml

Is this related to this line in the backend? Shouldn't this be bias of the actor's input layer instead of the encoder's input layer? https://github.com/NVIDIA/TorchFort/blob/d5f6308b6aba73a784db1fd95a89d524f4f7e729/src/csrc/models/rl/actor_critic_model.cpp#L44

PS: What is the additional encoder layer here? As you know, I am using n_envs to exploit the multi-agent RL. My state has 2 features and action has 1 feature.

Sep 19 '25 13:09 SachinBM-CE

Can you please look into the last error (posted above) when working with PPO. Thank you.

Sep 22 '25 15:09 SachinBM-CE

So, I never got SAC to work really well. The issue I am having with that is that the log prob diverges easily and the you see nan. I am not sure how to remedy this, since I am not an RL expert. I compared my implementation with that of stable baselines and could not find obvious bugs, but that does not mean there are none. Also, in all the small tests for simple environments, SAC performs worse than the others. I am not sure why (it still passes the tests though, but needs a lot of tuning). Since you have a working code there though, you can try to play around with the clipping and squashing.

Sep 22 '25 16:09 azrael417

Concerning PPO, the line you are pointing at is just a bias parameter for the previous layer. I think there is a typo there

instead of

actor_biases.push_back(register_parameter("actor_b_entry", torch::zeros(encoder_layer_sizes[0])));

can you try

actor_biases.push_back(register_parameter("actor_b_entry", torch::zeros(actor_layer_sizes[0])));

and let me know if that runs?

Sep 22 '25 16:09 azrael417

As you can see from below, the second parameter in linear is the number of output features and the bias has to have the same shape:

    actor_layers.push_back(register_module("actor_fc_" + std::to_string(i),
                                           torch::nn::Linear(actor_layer_sizes[i], actor_layer_sizes[i + 1])));
    actor_biases.push_back(register_parameter("actor_b_" + std::to_string(i), torch::zeros(actor_layer_sizes[i + 1])));
  }```

here it looks correct.

Sep 22 '25 16:09 azrael417

Thank you ! I changed the code as per your suggestion (actor_layer_sizes[0] instead of encoder_layer_sizes[0]), it doesn't throw the previous error anymore.

However, the predict function throws the following error now:

terminate called after throwing an instance of 'c10::Error'
  what():  output with shape [12800, 1] doesn't match the broadcast shape [12800, 8]

I am using the following configuration for my multi-agent RL case:

rollout_buffer:
  type: gae_lambda
  parameters:
    size: 12800
    n_envs: 12800

actor_critic_model:
  type: ActorCriticMLP
  parameters:
    dropout: 0.0
    encoder_layer_sizes: [2, 16, 8] 
    actor_layer_sizes: [8, 1] 
    value_layer_sizes: [8, 1] 
    state_dependent_sigma: False
    log_sigma_init: 0.

The idea behind the above config was that from the config templates, I assumed that the output of the encoder branches into the actor network and critic network. Is there some discrepancy in the architecture I'm setting? Also, does PPO support the multi-agent case?

For the context, the state has 2 features and the action has 1 feature.

Sep 24 '25 15:09 SachinBM-CE

Can you please check if any of these is causing the above issue:

https://github.com/NVIDIA/TorchFort/blob/b6353c4594a1ca4816ea94ee55ef66e391bf0be3/src/csrc/models/rl/actor_critic_model.cpp#L113

https://github.com/NVIDIA/TorchFort/blob/b6353c4594a1ca4816ea94ee55ef66e391bf0be3/src/csrc/models/rl/actor_critic_model.cpp#L105

Sep 25 '25 09:09 SachinBM-CE

Hi! Can you please look into the above issue when free. Thanks.

Sep 29 '25 07:09 SachinBM-CE

Hello Sachin,

good to know that it fixed your issue. Looking at the code, I am wondering about multiple things: your nenvs is equal to the rollout buffer size, which means you need to train after every step. The rollout buffer should ideally hold a longer trajectory, so it should be of size trajectory_length * n_envs. Then, you need to make sure that the shapes are correct. In Fortran, what are the array shapes you push as state to the rollout buffer and what are the shapes you use for running predict?

Sep 30 '25 05:09 azrael417

Also, what are the other settings, are you using squashing or rescaling? And can you please tell me where it fails, I need to see the full error message.

Sep 30 '25 05:09 azrael417