π[BUG]: DOMINO issues
Version
1.2.0
On which installation method(s) does this occur?
Docker
Describe the issue
Hello,
I compared DOMINO results using Version 1.2.0 and Version 1.1.1 on the same dataset, with both runs performed on a single GPU. The results are shown below. Only "surface" is used for training.
Although Version 1.2.0 demonstrates much better memory efficiency - enabling the use of a larger neural network -the outputs for Cp (pressure) and wall shear stress are considerably worse and diverge significantly from expectations.
Do you happen to have a successful benchmark case (with all settings specified) that I could replicate on my side? Thanks!
=================CONFIG.yaml========= variables: surface: solution: # The following is for AWS DrivAer dataset. pressure_average: scalar wall_shear_stress_average: vector volume: solution: # The following is for AWS DrivAer dataset. velocity_average: vector pressure_average: scalar global_parameters: inlet_velocity: type: vector reference: [50.0] # vector [30, 0, 0] should be specified as [30], while [30, 30, 0] should be [30, 30]. air_density: type: scalar reference: 1.225
model:
model_type: surface # train which model? surface, volume, combined
activation: "relu" # "relu" or "gelu"
loss_function:
loss_type: "mse" # mse or rmse
area_weighing_factor: 800000 # Generally inverse of maximum area
interp_res: [128, 64, 64] # resolution of latent space 128, 64, 48
use_sdf_in_basis_func: true # SDF in basis function network
positional_encoding: false # calculate positional encoding?
volume_points_sample: 0 # Number of points to sample in volume per epoch
surface_points_sample: 20_000 # Number of points to sample on surface per epoch
surface_sampling_algorithm: area_weighted # random or area_weighted
geom_points_sample: 1_000_000 # Number of points to sample on STL per epoch
num_surface_neighbors: 21 # How many neighbors on surface?
num_neighbors_surface: 21 # How many neighbors on surface?
num_neighbors_volume: 0 # How many neighbors on volume?
combine_volume_surface: false # combine volume and surface encodings
use_surface_normals: true # Use surface normals and surface areas for surface computation?
use_surface_area: true # Use only surface normals and not surface area
integral_loss_scaling_factor: 100000 # Scale integral loss by this factor
normalization: min_max_scaling # or mean_std_scaling or min_max_scaling
encode_parameters: false # encode inlet velocity and air density in the model
surf_loss_scaling: 10.0 # scale surface loss with this factor in combined mode
vol_loss_scaling: 1.0 # scale volume loss with this factor in combined mode
geometry_encoding_type: both # geometry encoder type, sdf, stl, both
solution_calculation_mode: two-loop # one-loop is better for sharded, two-loop is lower memory but more overhead
resampling_surface_mesh: # resampling of surface mesh before constructing kd tree
resample: false #false or true
points: 1_000_000 # number of points
geometry_rep: # Hyperparameters for geometry representation network
geo_conv:
base_neurons: 256 # 256 or 64
base_neurons_in: 16
base_neurons_out: 16
volume_radii: [0.1, 0.5, 1.0, 2.5] # radii for volume
surface_radii: [0.01, 0.05, 1.0] # radii for surface
surface_hops: 5 # Number of surface iterations
volume_hops: 1 # Number of volume iterations
volume_neighbors_in_radius: [10, 10, 10, 10] # Number of neighbors in radius for volume
surface_neighbors_in_radius: [10, 10, 10] # Number of neighbors in radius for surface
fourier_features: false
num_modes: 5
activation: ${model.activation}
geo_processor:
base_filters: 8
activation: ${model.activation}
processor_type: unet # conv or unet
self_attention: true
cross_attention: true
nn_basis_functions: # Hyperparameters for basis function network
base_layer: 512
fourier_features: true
num_modes: 5
activation: ${model.activation}
local_point_conv:
activation: ${model.activation}
aggregation_model: # Hyperparameters for aggregation network
base_layer: 512
activation: ${model.activation}
position_encoder: # Hyperparameters for position encoding network
base_neurons: 512
activation: ${model.activation}
fourier_features: true
num_modes: 5
geometry_local: # Hyperparameters for local geometry extraction
volume_neighbors_in_radius: [64, 128] # Number of radius points
surface_neighbors_in_radius: [64, 128] # Number of radius points
volume_radii: [0.1, 0.25] # Volume radii
surface_radii: [0.05, 0.25] # Surface radii
base_layer: 512
parameter_model:
base_layer: 512
fourier_features: true
num_modes: 5
activation: ${model.activation}
Minimum reproducible example
Relevant log output
Environment details
Here is a quick workaround to get a decent result.
- Use cache_data.py to convert zarr to npy
- In config.yaml, set surface_points_sample = 0
It should work on single GPU.
Are you setting surface_points_sample = 0 or is that a typo?
Hi @RishikeshRanade,
Correct, only surface_points_sample = 0 setting runs well on single GPU (using .npy files) - without trigger any error.
For instance, when I set surface_points_sample = 1_000, I encountered the following error.
(This was not the case in the previous Version 1.1.1.)
I hope this sheds some light.
===================================
"Error executing job with overrides: [] Traceback (most recent call last): File "/workspace2508/physicsnemo/examples/cfd/external_aerodynamics/domino/src/train.py", line 1006, in main avg_loss = train_epoch( ^^^^^^^^^^^^ File "/workspace2508/physicsnemo/examples/cfd/external_aerodynamics/domino/src/train.py", line 688, in train_epoch for i_batch, sample_batched in enumerate(dataloader): File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 733, in next data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 789, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] ~~~~~~~~~~~~^^^^^ File "/workspace2508/physicsnemo/physicsnemo/datapipes/cae/domino_datapipe.py", line 1383, in getitem ii = result["neighbor_indices"] ~~~~~~^^^^^^^^^^^^^^^^^^^^ KeyError: 'neighbor_indices'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
It doesnt make sense to have surface_points_sample=0 because that means 0 points are sampled to calculate loss. Are you sure its surface_points_sample or some other variable?
Hey @RishikeshRanade,
I agree, itβs a surprise!
Iβve done a few single-GPU experiments for comparison, my current findings have confirmed that surface_points_sample=0 is what gives the biggest "improvement" - though itβs unexpected. I didnβt observe any gains beyond that setup.
Iβd be happy to run a quick test if you share your config.py, or run any specific test youβd like me to - just LMK. Thanks!
Here is my config setting
βββββββββββββββββββββββββββββββββββββββββββββ
β Project Details β
βββββββββββββββββββββββββββββββββββββββββββββ
project: # Project name name: TestREV20
exp_tag: cached # Experiment tag
Main output directory.
project_dir: /workspace/data/${project.name}/outputs/ output: /workspace/data/${project.name}/outputs/${exp_tag}
hydra: # Hydra config run: dir: ${output} output_subdir: hydra # Default is .hydra which causes files not being uploaded in W&B.
The directory to search for checkpoints to continue training.
resume_dir: ${output}/models
βββββββββββββββββββββββββββββββββββββββββββββ
β Data Preprocessing β
βββββββββββββββββββββββββββββββββββββββββββββ
data_processor: # Data processor configurable parameters kind: drivesim # must be either drivesim or drivaer_aws output_dir: /workspace/data/${project.name}/processedDataset input_dir: /workspace/data/${project.name}/rawDataset cached_dir: /workspace/data/${project.name}/cache/
use_cache: true num_processors: 1
βββββββββββββββββββββββββββββββββββββββββββββ
β Solution variables β
βββββββββββββββββββββββββββββββββββββββββββββ
variables: surface: solution: # The following is for AWS DrivAer dataset. pressure_average: scalar wall_shear_stress_average: vector volume: solution: # The following is for AWS DrivAer dataset. velocity_average: vector pressure_average: scalar global_parameters: inlet_velocity: type: vector reference: [50.0] # vector [30, 0, 0] should be specified as [30], while [30, 30, 0] should be [30, 30]. air_density: type: scalar reference: 1.225
βββββββββββββββββββββββββββββββββββββββββββββ
β Training Data Configs β
βββββββββββββββββββββββββββββββββββββββββββββ
data: # Input directory for training and validation data input_dir: /workspace/data/${project.name}/training_processed/ input_dir_val: /workspace/data/${project.name}/validation_processed/ bounding_box: # Bounding box dimensions for computational domain min: [-2.0, -1.0 , -0.1] max: [5 , 1.0 , 1.0] bounding_box_surface: # Bounding box dimensions for car surface min: [-0.5, -0.3 , -0.05] max: [1.1 , 0.3 , 0.5] gpu_preprocessing: true gpu_output: true
βββββββββββββββββββββββββββββββββββββββββββββ
β Domain Parallelism Settings β
βββββββββββββββββββββββββββββββββββββββββββββ
domain_parallelism: domain_size: 1 shard_grid: false shard_points: false
βββββββββββββββββββββββββββββββββββββββββββββ
β Model Parameters β
βββββββββββββββββββββββββββββββββββββββββββββ
model:
model_type: surface # train which model? surface, volume, combined
activation: "relu" # "relu" or "gelu"
loss_function:
loss_type: "mse" # mse or rmse
area_weighing_factor: 800000 # Generally inverse of maximum area
interp_res: [128, 64, 64] # resolution of latent space 128, 64, 48
use_sdf_in_basis_func: true # SDF in basis function network
positional_encoding: false # calculate positional encoding?
volume_points_sample: 0 # Number of points to sample in volume per epoch
surface_points_sample: 0 # Number of points to sample on surface per epoch
surface_sampling_algorithm: random # random or area_weighted
geom_points_sample: 600_000 # Number of points to sample on STL per epoch
num_surface_neighbors: 2 # How many neighbors on surface?
num_neighbors_surface: 2 # How many neighbors on surface?
num_neighbors_volume: 0 # How many neighbors on volume?
combine_volume_surface: false # combine volume and surface encodings
use_surface_normals: true # Use surface normals and surface areas for surface computation?
use_surface_area: true # Use only surface normals and not surface area
integral_loss_scaling_factor: 100000 # Scale integral loss by this factor
normalization: min_max_scaling # or mean_std_scaling or min_max_scaling
encode_parameters: false # encode inlet velocity and air density in the model
surf_loss_scaling: 10.0 # scale surface loss with this factor in combined mode
vol_loss_scaling: 1.0 # scale volume loss with this factor in combined mode
geometry_encoding_type: both # geometry encoder type, sdf, stl, both
solution_calculation_mode: two-loop # one-loop is better for sharded, two-loop is lower memory but more overhead
resampling_surface_mesh: # resampling of surface mesh before constructing kd tree
resample: false #false or true
points: 1_000_000 # number of points
geometry_rep: # Hyperparameters for geometry representation network
geo_conv:
base_neurons: 64 # 256 or 64
base_neurons_in: 2
base_neurons_out: 2
volume_radii: [0.1, 0.5, 1.0, 2.5] # radii for volume
surface_radii: [0.1, 0.5, 1] # radii for surface
surface_hops: 5 # Number of surface iterations
volume_hops: 0 # Number of volume iterations
volume_neighbors_in_radius: [0, 0, 0, 0] # Number of neighbors in radius for volume
surface_neighbors_in_radius: [3, 3, 3] # Number of neighbors in radius for surface
fourier_features: false
num_modes: 5
activation: ${model.activation}
geo_processor:
base_filters: 8
activation: ${model.activation}
processor_type: conv # conv or unet
self_attention: false
cross_attention: false
nn_basis_functions: # Hyperparameters for basis function network
base_layer: 64 #512
fourier_features: false
num_modes: 5 #5
activation: ${model.activation}
local_point_conv:
activation: ${model.activation}
aggregation_model: # Hyperparameters for aggregation network
base_layer: 64 #512
activation: ${model.activation}
position_encoder: # Hyperparameters for position encoding network
base_neurons: 64 #512
activation: ${model.activation}
fourier_features: false
num_modes: 5 #5
geometry_local: # Hyperparameters for local geometry extraction
volume_neighbors_in_radius: [64, 128] # Number of radius points
surface_neighbors_in_radius: [32, 64] # Number of radius points
volume_radii: [0.1, 0.25] # Volume radii
surface_radii: [0.1, 0.25] # Surface radii
base_layer: 64
parameter_model:
base_layer: 64 #512
fourier_features: false
num_modes: 5 #5
activation: ${model.activation}
βββββββββββββββββββββββββββββββββββββββββββββ
β Training Configs β
βββββββββββββββββββββββββββββββββββββββββββββ
train: # Training configurable parameters epochs: 10 checkpoint_interval: 1 dataloader: batch_size: 1 pin_memory: false # if the preprocessing is outputing GPU data, set this to false sampler: shuffle: true drop_last: false checkpoint_dir: /workspace/data/${project.name}/checkpoints # Use only for retrainin
βββββββββββββββββββββββββββββββββββββββββββββ
β Validation Configs β
βββββββββββββββββββββββββββββββββββββββββββββ
val: # Validation configurable parameters dataloader: batch_size: 1 pin_memory: false # if the preprocessing is outputing GPU data, set this to false sampler: shuffle: true drop_last: false
βββββββββββββββββββββββββββββββββββββββββββββ
β Testing data Configs β
βββββββββββββββββββββββββββββββββββββββββββββ
eval: # Testing configurable parameters test_path: /workspace/data/${project.name}/testing/ # Dir for testing data in raw format (vtp, vtu ,stls) save_path: /workspace/data/${project.name}/prediction_results/ # Dir to save predicted results in raw format (vtp, vtu) checkpoint_name: DoMINO.0.8.pt # Name of checkpoint to select from saved checkpoints scaling_param_path: /workspace/data/${project.name}/outputs refine_stl: False # Automatically refine STL during inference stencil_size: 7 # Stencil size for evaluating surface and volume model
Iβm experiencing the same issue while using version 1.3.0a0. However, when I set surface_points_sample = 0, I get the following errors:
============================================================
Traceback (most recent call last):
File "/work/u33l96/domino_prj/domino/src/train.py", line 1004, in main
avg_loss = train_epoch(
File "/work/u33l96/domino_prj/domino/src/train.py", line 683, in train_epoch
for i_batch, sample_batched in enumerate(dataloader):
File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in next
data = self._next_data()
File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 788, in _next_data
data = self._dataset_fetcher.fetch(iandex) # may raise StopIteration
File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in
Hi @fpan2232 , @kk98kk ,
@kk98kk - using non-cached data, it will fail if you set surface_points_sample to 0 (as you see). It does a kNN over the set of sampled points to find nearest neighbors, in the dataloader. This kNN breaks if you have no points in the query set, which happens with surface_points_sample=0.
We just merged a big update to DoMINO: including performance updates, we also did some stability / accuracy testing.
Would you mind testing with the latest version of the code? Note that if you're using DrivaerML, there is an IO optimization that will help a lot. Essentially, this does an up-front sampling and saves to disk, rather than streaming all the volume data. Compared to 25.03 release, end-to-end is 30x faster or more on DrivaerML. Smaller datasets will see speed boosts but not as dramatic.
One thing we noticed during testing is the importance to use the same number of sampling points in training and testing, or the validation accuracy is incorrect.
We've added metric printouts in our training scripts / examples to help ensure your model is converging properly.
Please, continue to report any and all issues.
Hi All - any news / updates from your side? Are you able to try with the latest code? If there are still issues we'd like to help.
Hi,
It looks very promising so far! However, since volume data (for volume_factors) now also needs to be processed even for surface-only trainings, it will take a bit longer before I can give more conclusive feedback.
I also have a few general questions about the DoMINO architecture and how you handle the DrivAerML dataset. Would it be possible to discuss this in more detail via e.g. email or another channel that you prefer?
@kk98kk It would be great if you can share your work and how PhysicsNeMo has been helpful in the Discussion section. Also it will be useful to get https://github.com/NVIDIA/physicsnemo/discussions/1205 on what functionality in PhysicsNeMo has been most helpful to you.
@ram-cherukuri Of course. My results will be sooner or later anyways public, but I will check this with my professor, to be sure.
@ram-cherukuri I think you can close the case. The predictions from the latest version (25.08) look correct again. Here is an example of one: