physicsnemo icon indicating copy to clipboard operation
physicsnemo copied to clipboard

πŸ›[BUG]: DOMINO issues

Open fpan2232 opened this issue 3 months ago β€’ 13 comments

Version

1.2.0

On which installation method(s) does this occur?

Docker

Describe the issue

Hello,

I compared DOMINO results using Version 1.2.0 and Version 1.1.1 on the same dataset, with both runs performed on a single GPU. The results are shown below. Only "surface" is used for training.

Although Version 1.2.0 demonstrates much better memory efficiency - enabling the use of a larger neural network -the outputs for Cp (pressure) and wall shear stress are considerably worse and diverge significantly from expectations.

Do you happen to have a successful benchmark case (with all settings specified) that I could replicate on my side? Thanks!

Image Image Image

=================CONFIG.yaml========= variables: surface: solution: # The following is for AWS DrivAer dataset. pressure_average: scalar wall_shear_stress_average: vector volume: solution: # The following is for AWS DrivAer dataset. velocity_average: vector pressure_average: scalar global_parameters: inlet_velocity: type: vector reference: [50.0] # vector [30, 0, 0] should be specified as [30], while [30, 30, 0] should be [30, 30]. air_density: type: scalar reference: 1.225

model: model_type: surface # train which model? surface, volume, combined activation: "relu" # "relu" or "gelu" loss_function: loss_type: "mse" # mse or rmse area_weighing_factor: 800000 # Generally inverse of maximum area interp_res: [128, 64, 64] # resolution of latent space 128, 64, 48 use_sdf_in_basis_func: true # SDF in basis function network positional_encoding: false # calculate positional encoding? volume_points_sample: 0 # Number of points to sample in volume per epoch surface_points_sample: 20_000 # Number of points to sample on surface per epoch surface_sampling_algorithm: area_weighted # random or area_weighted geom_points_sample: 1_000_000 # Number of points to sample on STL per epoch num_surface_neighbors: 21 # How many neighbors on surface? num_neighbors_surface: 21 # How many neighbors on surface? num_neighbors_volume: 0 # How many neighbors on volume? combine_volume_surface: false # combine volume and surface encodings use_surface_normals: true # Use surface normals and surface areas for surface computation? use_surface_area: true # Use only surface normals and not surface area integral_loss_scaling_factor: 100000 # Scale integral loss by this factor normalization: min_max_scaling # or mean_std_scaling or min_max_scaling encode_parameters: false # encode inlet velocity and air density in the model surf_loss_scaling: 10.0 # scale surface loss with this factor in combined mode vol_loss_scaling: 1.0 # scale volume loss with this factor in combined mode geometry_encoding_type: both # geometry encoder type, sdf, stl, both solution_calculation_mode: two-loop # one-loop is better for sharded, two-loop is lower memory but more overhead resampling_surface_mesh: # resampling of surface mesh before constructing kd tree resample: false #false or true points: 1_000_000 # number of points geometry_rep: # Hyperparameters for geometry representation network geo_conv: base_neurons: 256 # 256 or 64 base_neurons_in: 16 base_neurons_out: 16 volume_radii: [0.1, 0.5, 1.0, 2.5] # radii for volume surface_radii: [0.01, 0.05, 1.0] # radii for surface surface_hops: 5 # Number of surface iterations volume_hops: 1 # Number of volume iterations volume_neighbors_in_radius: [10, 10, 10, 10] # Number of neighbors in radius for volume surface_neighbors_in_radius: [10, 10, 10] # Number of neighbors in radius for surface fourier_features: false num_modes: 5 activation: ${model.activation}
geo_processor: base_filters: 8 activation: ${model.activation} processor_type: unet # conv or unet self_attention: true cross_attention: true nn_basis_functions: # Hyperparameters for basis function network base_layer: 512 fourier_features: true num_modes: 5 activation: ${model.activation} local_point_conv: activation: ${model.activation} aggregation_model: # Hyperparameters for aggregation network base_layer: 512 activation: ${model.activation} position_encoder: # Hyperparameters for position encoding network base_neurons: 512 activation: ${model.activation} fourier_features: true num_modes: 5 geometry_local: # Hyperparameters for local geometry extraction volume_neighbors_in_radius: [64, 128] # Number of radius points surface_neighbors_in_radius: [64, 128] # Number of radius points volume_radii: [0.1, 0.25] # Volume radii surface_radii: [0.05, 0.25] # Surface radii base_layer: 512 parameter_model: base_layer: 512 fourier_features: true num_modes: 5 activation: ${model.activation}

Minimum reproducible example


Relevant log output


Environment details


fpan2232 avatar Sep 17 '25 19:09 fpan2232

Here is a quick workaround to get a decent result.

  1. Use cache_data.py to convert zarr to npy
  2. In config.yaml, set surface_points_sample = 0

It should work on single GPU.

Image

fpan2232 avatar Sep 19 '25 18:09 fpan2232

Are you setting surface_points_sample = 0 or is that a typo?

RishikeshRanade avatar Oct 02 '25 13:10 RishikeshRanade

Hi @RishikeshRanade,

Correct, only surface_points_sample = 0 setting runs well on single GPU (using .npy files) - without trigger any error.

For instance, when I set surface_points_sample = 1_000, I encountered the following error.

(This was not the case in the previous Version 1.1.1.)

I hope this sheds some light.

===================================

"Error executing job with overrides: [] Traceback (most recent call last): File "/workspace2508/physicsnemo/examples/cfd/external_aerodynamics/domino/src/train.py", line 1006, in main avg_loss = train_epoch( ^^^^^^^^^^^^ File "/workspace2508/physicsnemo/examples/cfd/external_aerodynamics/domino/src/train.py", line 688, in train_epoch for i_batch, sample_batched in enumerate(dataloader): File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 733, in next data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 789, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] ~~~~~~~~~~~~^^^^^ File "/workspace2508/physicsnemo/physicsnemo/datapipes/cae/domino_datapipe.py", line 1383, in getitem ii = result["neighbor_indices"] ~~~~~~^^^^^^^^^^^^^^^^^^^^ KeyError: 'neighbor_indices'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

fpan2232 avatar Oct 02 '25 16:10 fpan2232

It doesnt make sense to have surface_points_sample=0 because that means 0 points are sampled to calculate loss. Are you sure its surface_points_sample or some other variable?

RishikeshRanade avatar Oct 07 '25 18:10 RishikeshRanade

Hey @RishikeshRanade,

I agree, it’s a surprise!

I’ve done a few single-GPU experiments for comparison, my current findings have confirmed that surface_points_sample=0 is what gives the biggest "improvement" - though it’s unexpected. I didn’t observe any gains beyond that setup.

I’d be happy to run a quick test if you share your config.py, or run any specific test you’d like me to - just LMK. Thanks!

fpan2232 avatar Oct 07 '25 18:10 fpan2232

Here is my config setting

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Project Details β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

project: # Project name name: TestREV20

exp_tag: cached # Experiment tag

Main output directory.

project_dir: /workspace/data/${project.name}/outputs/ output: /workspace/data/${project.name}/outputs/${exp_tag}

hydra: # Hydra config run: dir: ${output} output_subdir: hydra # Default is .hydra which causes files not being uploaded in W&B.

The directory to search for checkpoints to continue training.

resume_dir: ${output}/models

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Data Preprocessing β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

data_processor: # Data processor configurable parameters kind: drivesim # must be either drivesim or drivaer_aws output_dir: /workspace/data/${project.name}/processedDataset input_dir: /workspace/data/${project.name}/rawDataset cached_dir: /workspace/data/${project.name}/cache/

use_cache: true num_processors: 1

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Solution variables β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

variables: surface: solution: # The following is for AWS DrivAer dataset. pressure_average: scalar wall_shear_stress_average: vector volume: solution: # The following is for AWS DrivAer dataset. velocity_average: vector pressure_average: scalar global_parameters: inlet_velocity: type: vector reference: [50.0] # vector [30, 0, 0] should be specified as [30], while [30, 30, 0] should be [30, 30]. air_density: type: scalar reference: 1.225

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Training Data Configs β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

data: # Input directory for training and validation data input_dir: /workspace/data/${project.name}/training_processed/ input_dir_val: /workspace/data/${project.name}/validation_processed/ bounding_box: # Bounding box dimensions for computational domain min: [-2.0, -1.0 , -0.1] max: [5 , 1.0 , 1.0] bounding_box_surface: # Bounding box dimensions for car surface min: [-0.5, -0.3 , -0.05] max: [1.1 , 0.3 , 0.5] gpu_preprocessing: true gpu_output: true

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Domain Parallelism Settings β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

domain_parallelism: domain_size: 1 shard_grid: false shard_points: false

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Model Parameters β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

model: model_type: surface # train which model? surface, volume, combined activation: "relu" # "relu" or "gelu" loss_function: loss_type: "mse" # mse or rmse area_weighing_factor: 800000 # Generally inverse of maximum area interp_res: [128, 64, 64] # resolution of latent space 128, 64, 48 use_sdf_in_basis_func: true # SDF in basis function network positional_encoding: false # calculate positional encoding? volume_points_sample: 0 # Number of points to sample in volume per epoch surface_points_sample: 0 # Number of points to sample on surface per epoch surface_sampling_algorithm: random # random or area_weighted geom_points_sample: 600_000 # Number of points to sample on STL per epoch num_surface_neighbors: 2 # How many neighbors on surface? num_neighbors_surface: 2 # How many neighbors on surface? num_neighbors_volume: 0 # How many neighbors on volume? combine_volume_surface: false # combine volume and surface encodings use_surface_normals: true # Use surface normals and surface areas for surface computation? use_surface_area: true # Use only surface normals and not surface area integral_loss_scaling_factor: 100000 # Scale integral loss by this factor normalization: min_max_scaling # or mean_std_scaling or min_max_scaling encode_parameters: false # encode inlet velocity and air density in the model surf_loss_scaling: 10.0 # scale surface loss with this factor in combined mode vol_loss_scaling: 1.0 # scale volume loss with this factor in combined mode geometry_encoding_type: both # geometry encoder type, sdf, stl, both solution_calculation_mode: two-loop # one-loop is better for sharded, two-loop is lower memory but more overhead resampling_surface_mesh: # resampling of surface mesh before constructing kd tree resample: false #false or true points: 1_000_000 # number of points geometry_rep: # Hyperparameters for geometry representation network geo_conv: base_neurons: 64 # 256 or 64 base_neurons_in: 2 base_neurons_out: 2 volume_radii: [0.1, 0.5, 1.0, 2.5] # radii for volume surface_radii: [0.1, 0.5, 1] # radii for surface surface_hops: 5 # Number of surface iterations volume_hops: 0 # Number of volume iterations volume_neighbors_in_radius: [0, 0, 0, 0] # Number of neighbors in radius for volume surface_neighbors_in_radius: [3, 3, 3] # Number of neighbors in radius for surface fourier_features: false num_modes: 5 activation: ${model.activation}
geo_processor: base_filters: 8 activation: ${model.activation} processor_type: conv # conv or unet self_attention: false cross_attention: false nn_basis_functions: # Hyperparameters for basis function network base_layer: 64 #512 fourier_features: false num_modes: 5 #5 activation: ${model.activation} local_point_conv: activation: ${model.activation} aggregation_model: # Hyperparameters for aggregation network base_layer: 64 #512 activation: ${model.activation} position_encoder: # Hyperparameters for position encoding network base_neurons: 64 #512 activation: ${model.activation} fourier_features: false num_modes: 5 #5 geometry_local: # Hyperparameters for local geometry extraction volume_neighbors_in_radius: [64, 128] # Number of radius points surface_neighbors_in_radius: [32, 64] # Number of radius points volume_radii: [0.1, 0.25] # Volume radii surface_radii: [0.1, 0.25] # Surface radii base_layer: 64 parameter_model: base_layer: 64 #512 fourier_features: false num_modes: 5 #5 activation: ${model.activation}

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Training Configs β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

train: # Training configurable parameters epochs: 10 checkpoint_interval: 1 dataloader: batch_size: 1 pin_memory: false # if the preprocessing is outputing GPU data, set this to false sampler: shuffle: true drop_last: false checkpoint_dir: /workspace/data/${project.name}/checkpoints # Use only for retrainin

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Validation Configs β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

val: # Validation configurable parameters dataloader: batch_size: 1 pin_memory: false # if the preprocessing is outputing GPU data, set this to false sampler: shuffle: true drop_last: false

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚ Testing data Configs β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

eval: # Testing configurable parameters test_path: /workspace/data/${project.name}/testing/ # Dir for testing data in raw format (vtp, vtu ,stls) save_path: /workspace/data/${project.name}/prediction_results/ # Dir to save predicted results in raw format (vtp, vtu) checkpoint_name: DoMINO.0.8.pt # Name of checkpoint to select from saved checkpoints scaling_param_path: /workspace/data/${project.name}/outputs refine_stl: False # Automatically refine STL during inference stencil_size: 7 # Stencil size for evaluating surface and volume model

fpan2232 avatar Oct 07 '25 18:10 fpan2232

I’m experiencing the same issue while using version 1.3.0a0. However, when I set surface_points_sample = 0, I get the following errors:

============================================================

Traceback (most recent call last): File "/work/u33l96/domino_prj/domino/src/train.py", line 1004, in main avg_loss = train_epoch( File "/work/u33l96/domino_prj/domino/src/train.py", line 683, in train_epoch for i_batch, sample_batched in enumerate(dataloader): File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in next data = self._next_data() File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 788, in _next_data data = self._dataset_fetcher.fetch(iandex) # may raise StopIteration File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/work/u33l96/project/physicsnemo/physicsnemo/datapipes/cae/domino_datapipe.py", line 1008, in getitem return_dict = self.preprocess_data(data_dict) File "/work/u33l96/project/physicsnemo/physicsnemo/datapipes/cae/domino_datapipe.py", line 972, in preprocess_data surface_dict = self.preprocess_surface( File "/work/u33l96/project/physicsnemo/physicsnemo/datapipes/cae/domino_datapipe.py", line 714, in preprocess_surface ii = knn.kneighbors( File "/work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/cuml/internals/api_decorators.py", line 211, in wrapper ret = func(args, kwargs) File "cuml/neighbors/nearest_neighbors.pyx", line 768, in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors File "cuml/neighbors/nearest_neighbors.pyx", line 855, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_internal File "cuml/neighbors/nearest_neighbors.pyx", line 973, in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense File "cuml/neighbors/nearest_neighbors.pyx", line 275, in cuml.neighbors.nearest_neighbors.RBCIndex.kneighbors RuntimeError: exception occurred! file=/__w/cuml/cuml/python/libcuml/build/py3-none-linux_x86_64/_deps/cuvs-src/cpp/src/neighbors/./detail/./fused_l2_knn.cuh line=967: l2Knn: n_query_rows must be > 0 Obtained 63 stack frames #1 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/libcuml/lib64/libcuml++.so: void cuvs::neighbors::detail::fusedL2Knn<long, float, false, float>(unsigned long, long, float, float const, float const*, unsigned long, unsigned long, int, bool, bool, CUstream_st*, cuvsDistanceType, float const*, float const*) +0x553 [0x146db46956c3] #2 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/libcuml/lib64/libcuml++.so: void cuvs::neighbors::detail::brute_force_knn_impl<long, long, float, float>(raft::resources const&, std::vector<float*, std::allocator<float*> >&, std::vector<long, std::allocator >&, long, float*, long, long*, float*, long, bool, bool, std::vector<long, std::allocator >, cuvsDistanceType, float, std::vector<float, std::allocator<float*> >, float const) +0x147d [0x146db46abbbd] #3 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/libcuml/lib64/libcuml++.so: void cuvs::neighbors::detail::search<float, long, float, std::experimental::layout_right>(raft::resources const&, cuvs::neighbors::brute_force::index<float, float> const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor, (raft::memory_type)2> >, cuvs::neighbors::filtering::base_filter const&) +0x18c [0x146db46e349c] #4 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/libcuml/lib64/libcuml++.so: void cuvs::neighbors::ball_cover::detail::k_closest_landmarks<long, float>(raft::resources const&, cuvs::neighbors::ball_cover::index<long, float> const&, float const*, long, long, long*, float*) +0x14a [0x146db46191aa] #5 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/libcuml/lib64/libcuml++.so: void cuvs::neighbors::ball_cover::detail::rbc_knn_query<long, float>(raft::resources const&, cuvs::neighbors::ball_cover::index<long, float> const&, long, float const*, long, long*, float*, bool, float) +0x545 [0x146db461bb25] #6 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/libcuml/lib64/libcuml++.so: ML::rbc_knn_query(raft::handle_t const&, unsigned long const&, unsigned int, float const*, unsigned int, long, long*, float*) +0xb2 [0x146db3c91f12] #7 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/cuml/neighbors/nearest_neighbors.cpython-310-x86_64-linux-gnu.so(+0x2b6da) [0x146d2af8a6da] #8 in python: PyObject_VectorcallMethod +0x83 [0x561bb767b5b3] #9 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/cuml/neighbors/nearest_neighbors.cpython-310-x86_64-linux-gnu.so(+0x2f14c) [0x146d2af8e14c] #10 in python: PyObject_VectorcallMethod +0x83 [0x561bb767b5b3] #11 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/cuml/neighbors/nearest_neighbors.cpython-310-x86_64-linux-gnu.so(+0x33be8) [0x146d2af92be8] #12 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/cuml/neighbors/nearest_neighbors.cpython-310-x86_64-linux-gnu.so(+0x3c49a) [0x146d2af9b49a] #13 in python: PyObject_VectorcallMethod +0x83 [0x561bb767b5b3] #14 in /work/u33l96/miniforge3/envs/domino/lib/python3.10/site-packages/cuml/neighbors/nearest_neighbors.cpython-310-x86_64-linux-gnu.so(+0x1a757) [0x146d2af79757] #15 in python: PyObject_Call +0xba [0x561bb766dbfa] #16 in python: _PyEval_EvalFrameDefault +0x2d09 [0x561bb76530b9] #17 in python(+0x148277) [0x561bb766d277] #18 in python: _PyEval_EvalFrameDefault +0x130f [0x561bb76516bf] #19 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #20 in python: _PyEval_EvalFrameDefault +0x6fb [0x561bb7650aab] #21 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #22 in python: _PyEval_EvalFrameDefault +0x6fb [0x561bb7650aab] #23 in python(+0x180d6d) [0x561bb76a5d6d] #24 in python(+0x180bda) [0x561bb76a5bda] #25 in python: _PyEval_EvalFrameDefault +0xb41 [0x561bb7650ef1] #26 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #27 in python: _PyEval_EvalFrameDefault +0x313 [0x561bb76506c3] #28 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #29 in python: _PyEval_EvalFrameDefault +0x6fb [0x561bb7650aab] #30 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #31 in python: _PyEval_EvalFrameDefault +0x6fb [0x561bb7650aab] #32 in python(+0x180d6d) [0x561bb76a5d6d] #33 in python(+0x1bc82c) [0x561bb76e182c] #34 in python(+0x19904d) [0x561bb76be04d] #35 in python: _PyEval_EvalFrameDefault +0x9b9 [0x561bb7650d69] #36 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #37 in python: PyObject_Call +0xba [0x561bb766dbfa] #38 in python: _PyEval_EvalFrameDefault +0x2d09 [0x561bb76530b9] #39 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #40 in python: _PyEval_EvalFrameDefault +0x313 [0x561bb76506c3] #41 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #42 in python: _PyEval_EvalFrameDefault +0x130f [0x561bb76516bf] #43 in python(+0x148277) [0x561bb766d277] #44 in python: _PyEval_EvalFrameDefault +0x130f [0x561bb76516bf] #45 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #46 in python: _PyEval_EvalFrameDefault +0x313 [0x561bb76506c3] #47 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #48 in python: _PyEval_EvalFrameDefault +0x313 [0x561bb76506c3] #49 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #50 in python: _PyEval_EvalFrameDefault +0x130f [0x561bb76516bf] #51 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #52 in python: _PyEval_EvalFrameDefault +0x130f [0x561bb76516bf] #53 in python: _PyFunction_Vectorcall +0x6c [0x561bb766055c] #54 in python: _PyEval_EvalFrameDefault +0x313 [0x561bb76506c3] #55 in python(+0x1d11cc) [0x561bb76f61cc] #56 in python: PyEval_EvalCode +0x85 [0x561bb76f6115] #57 in python(+0x202bea) [0x561bb7727bea] #58 in python(+0x1fd5d3) [0x561bb77225d3] #59 in python(+0x972d0) [0x561bb75bc2d0] #60 in python: _PyRun_SimpleFileObject +0x1bb [0x561bb771cc9b] #61 in python: _PyRun_AnyFileObject +0x44 [0x561bb771c834] #62 in python: Py_RunMain +0x371 [0x561bb7719ca1] #63 in python: Py_BytesMain +0x37 [0x561bb76e9447]

kk98kk avatar Oct 19 '25 08:10 kk98kk

Hi @fpan2232 , @kk98kk ,

@kk98kk - using non-cached data, it will fail if you set surface_points_sample to 0 (as you see). It does a kNN over the set of sampled points to find nearest neighbors, in the dataloader. This kNN breaks if you have no points in the query set, which happens with surface_points_sample=0.

We just merged a big update to DoMINO: including performance updates, we also did some stability / accuracy testing.

Would you mind testing with the latest version of the code? Note that if you're using DrivaerML, there is an IO optimization that will help a lot. Essentially, this does an up-front sampling and saves to disk, rather than streaming all the volume data. Compared to 25.03 release, end-to-end is 30x faster or more on DrivaerML. Smaller datasets will see speed boosts but not as dramatic.

One thing we noticed during testing is the importance to use the same number of sampling points in training and testing, or the validation accuracy is incorrect.

We've added metric printouts in our training scripts / examples to help ensure your model is converging properly.

Please, continue to report any and all issues.

coreyjadams avatar Oct 20 '25 14:10 coreyjadams

Hi All - any news / updates from your side? Are you able to try with the latest code? If there are still issues we'd like to help.

coreyjadams avatar Oct 28 '25 18:10 coreyjadams

Hi, It looks very promising so far! However, since volume data (for volume_factors) now also needs to be processed even for surface-only trainings, it will take a bit longer before I can give more conclusive feedback.

I also have a few general questions about the DoMINO architecture and how you handle the DrivAerML dataset. Would it be possible to discuss this in more detail via e.g. email or another channel that you prefer?

kk98kk avatar Oct 31 '25 09:10 kk98kk

@kk98kk It would be great if you can share your work and how PhysicsNeMo has been helpful in the Discussion section. Also it will be useful to get https://github.com/NVIDIA/physicsnemo/discussions/1205 on what functionality in PhysicsNeMo has been most helpful to you.

ram-cherukuri avatar Nov 07 '25 17:11 ram-cherukuri

@ram-cherukuri Of course. My results will be sooner or later anyways public, but I will check this with my professor, to be sure.

kk98kk avatar Nov 12 '25 11:11 kk98kk

@ram-cherukuri I think you can close the case. The predictions from the latest version (25.08) look correct again. Here is an example of one:

Image

kk98kk avatar Nov 17 '25 11:11 kk98kk