DeepLagrangianFluids icon indicating copy to clipboard operation
DeepLagrangianFluids copied to clipboard

Training the network with small timestep

Open duytrangiale opened this issue 3 years ago • 31 comments

Hi the team,

I have a question about the timestep that you used in the paper. According to the paper, the team has sampled the input data with the frequency of 50Hz. The ground truth simulation run for 1s each and therefore, the timestep in this case is 1/50 = 0.02s. I just wonder why do we need to sample the simulation data at this frequency? Why don't we just use all the data and with the timestep used in the SPH simulation? For my case, I use the same timestep in simulation, which is in the scale of 10^(-5), and the behaviour of the network is very strange. Could you please explain the reason for this?

Thanks for your help. Cheers

duytrangiale avatar Mar 27 '22 12:03 duytrangiale

The timestep is a hyperparameter and can be changed in the constructor of the model class. The training data should be compatible to the timestep usded for the advection step as this minimizes the correction the network has to learn.

Since you use very small timesteps there might be more changes necessary, e.g. reducing the radius of the convolutions or increasing the resolution of the filters.

benjaminum avatar Apr 03 '22 14:04 benjaminum

Hi Benjamin, Thanks for your suggestion. I just figured out that there is numerical issue with the calculation when using such small timestep. By using float64 datatype for the input data, I can solve this. However, there is still some problem with the training code when I try to convert the input data into float64 type. Also, I would like to know how can you find the number 128 in the loss function because it looks like it is not a random number. Thanks a lot

duytrangiale avatar Apr 03 '22 15:04 duytrangiale

The 128 is just for scaling the network output to roughly match the scale of the training data. The network can also learn this or the network initialization could be changed to better match the scale in the beginning of the training.

benjaminum avatar Apr 03 '22 15:04 benjaminum

I got it. Thanks a lot for your suggestion.

duytrangiale avatar Apr 03 '22 15:04 duytrangiale

Hi Benjamin,

I just trained the network with a smaller radius of the convolution (with my very small timestep as mentioned previously). However, it seems that the model cannot detect the boundary at all. All the particles just fall down due to gravity and go through the boundary. Do you have any suggestion on tuning the parameters that may affect this boundary detection?

Cheers

duytrangiale avatar Apr 04 '22 12:04 duytrangiale

The radius is an important parameter for the collision with the boundary. Another thing that can help is to increase the number of frames for which losses are computed. For debugging this it can help to gradually make the timestep smaller to see when it breaks and then tune the parameters for this configuration first.

benjaminum avatar Apr 05 '22 05:04 benjaminum

Hi Benjamin,

Thanks for the suggestion. I will try with this method.

Cheers

duytrangiale avatar Apr 05 '22 06:04 duytrangiale

Hi Benjamin,

I have tried to tune different parameters such as the radius scale, filter resolution, add more number of frames in computing the loss during training. However, none of these works. The particles still move in a chaotic way (sometimes with a few particles case, they are stuck in space without moving at all). The boundary collision is still poor. At the moment, I cannot find any clear bug in the code for these results. Could you have any suggestion for this?

One thing that I realise is that since I used a very small timestep, using float64 instead of float32 can solve the problem when updating the intermediate position. However, with this approach, I got the problem that the librabry "open3d.ml.torch.ops.build_spatial_hash_table" doesn't support float64. Do you have any idea on that?

Thanks for your help. Cheers

duytrangiale avatar Apr 15 '22 08:04 duytrangiale

Sorry for the late reply. Unfortunately, there are no plans for float64 support for that op in Open3D.

Is float64 needed for generating the data?

benjaminum avatar Jun 10 '22 14:06 benjaminum

Hi Benjamin,

It's good to see you come back. Well, for the problem of float64, I'm currently working on that. I try to modify the library that you wrote for Open3D to use the float64 datatype. The reason why I need to do so is because for my case, when using float32, the calculation in the integration part is not correct (due to the small number accuracy).

duytrangiale avatar Jun 10 '22 14:06 duytrangiale

In addition, could you please explain for me about the way you set the normal vectors for the boundary particles in the case of complex surface. In that case, how the vector looks like? Thanks for your help.

duytrangiale avatar Jun 10 '22 14:06 duytrangiale

The normal is defined by the triangle the particle has been sampled from. https://github.com/isl-org/DeepLagrangianFluids/blob/d651c6fdf2aca3fac9abe3693b20981b191b4769/datasets/create_physics_scenes.py#L141

benjaminum avatar Jun 10 '22 14:06 benjaminum

Thank you. I understand now. May I ask another question about the set of hyperparameters such as radius scale, number of layers or the spatial resolution. Currently, after running some experiments, I can identify a good set of configuration but I still cannot sure if that is the optimal one. Do you have any theory to base on when choosing these hyperparameters? Thanks

duytrangiale avatar Jun 10 '22 14:06 duytrangiale

The radius should be chosen such that there is a reasonable number of neighbors on average (30-40 for our data). If the timestep is smaller, I guess the number of neighbors probably can be smaller too. For number of layers or scaling we did not have a recipe for tuning these parameters.

benjaminum avatar Jun 11 '22 06:06 benjaminum

Hi Benjaminum,

Thanks for the suggestion.

duytrangiale avatar Jun 11 '22 07:06 duytrangiale

Hi Benjaminum,

I have a question related to the performance of the model. In your case, have you ever tried to predict the movement of just one or two particles? How did it perform? Can it perfectly detect the collision between particles and with the boundary?

In case of lots of particles like the fluid objects that you described, are all the particles predicted perfectly in terms of collision? Is there any particle go through the boundary?

Thanks in advance. Cheers

duytrangiale avatar Jun 24 '22 03:06 duytrangiale

No, we did not do experiments with just two particles.

The network handles collision well and in most of our scenes there are no problems but it can happen that particles tunnel through the boundary if they move very fast.

benjaminum avatar Jun 25 '22 07:06 benjaminum

Hi Benjaminum,

Yes, that is what I saw when running different cases and somehow, when the velocity of particles are too high, they will go through the boundary.

duytrangiale avatar Jun 25 '22 07:06 duytrangiale

Hi Benjaminum,

Recently I realise that the correction produced by the network is too small that it is not enough to compensate the error. And that makes my model predicted the collision really bad. Do you have any suggestion to modify any parameters in the network to make it better? Thanks Cheers

duytrangiale avatar Jul 01 '22 06:07 duytrangiale

To check if the network can produce larger values you can try to overfit to simulation frames with larger correction values. If that works then maybe increasing the importance of larger correction in general can help.

benjaminum avatar Jul 03 '22 06:07 benjaminum

Hi Benjamin,

Thanks for your advice. However, I'm still not really clear about your idea. How can I overfit to simulation frames? The problem now is that I don't know which parameters of the model should I tune to increase the correction produced by the network. Thank you.

duytrangiale avatar Jul 03 '22 07:07 duytrangiale

Just take 2 or 3 frames and see if the network can overfit to the large values in these frames. To make large corrections more important you can try changing the loss function, e.g., replace Euclidean distance with squared Euclidean distance.

benjaminum avatar Jul 04 '22 08:07 benjaminum

Thanks Benjamin, let me try these tips and see what happens. Cheers

duytrangiale avatar Jul 04 '22 10:07 duytrangiale

Hi Benjamin,

Could you please explain for me a bit about the network architecture. In particular, I wonder why the output of the network is the position correction but not a different unit (like velocity or acceleration)? What is the relation between the input and output? Currently, the inputs are velocity and normal vectors, and the output is position correction, why is that?

Sorry for asking too many questions. Thanks for your help. Cheers

duytrangiale avatar Jul 06 '22 02:07 duytrangiale

position change, velocity, and acceleration are all related. You can have a look at this paper. They have experimented with different update schemes https://cgl.ethz.ch/disclaimer.php?dlurl=/Downloads/Publications/Papers/2015/Jeo15a/Jeo15a.pdf

benjaminum avatar Jul 08 '22 08:07 benjaminum

Hi Benjamin,

Thanks for your suggestion. I will have a look at this paper. Also, I have followed your advice about using squared distance in loss function. However, the result is still bad (even worse than previous). To make larger correction, I tried to modify the scale of the network output and it seems that it has improved a bit but not good enough as expected. For the boundary, there is still no collision at all. Do you think this is the limit of current model that I should accept that or there is still something that I missed? Cheers

duytrangiale avatar Jul 08 '22 08:07 duytrangiale

I don't think that it is impossible for the model. Small timesteps usually make it easier for solvers. I would really just try to train on a dataset with just 2-3 frames to see if overfitting works.

benjaminum avatar Jul 12 '22 06:07 benjaminum

Hi Benjamin,

Thanks for the comment. Let me try it again. I also don't think the model has problem, it may be just because something wrong in my implementation. Another thing, I see that in your training data, each simulation scenario, you divided the data into 16 msgpack files. Is it just for convenient or is there any meaning to choose this number?

duytrangiale avatar Jul 12 '22 08:07 duytrangiale

It is just to optimize data io.

benjaminum avatar Jul 13 '22 06:07 benjaminum

Ah ok, thank you!

duytrangiale avatar Jul 13 '22 06:07 duytrangiale