relion DynaMight running issue "Index tensor must have the same number of dimensions as self tensor"

DynaMight flexibility run issue I have recently installed relion5 on our computing clusters. The 3D classification job and auto refinement job with Blush regularization can run successfully. However, the DynaMight flexibility jobs persistently failed with the error message pasted at the end. Could you help me resolve this?

Environment:

OS: CentOS7
MPI runtime: openmpi 4.1.1
RELION version: RELION-5.0-beta-0-commit-70875e
Memory: 200GB requested with sbatch
GPU: A40 with 46GB memory

Dataset:

Box size: 320 px
Pixel size: 0.996 A
Number of particles: 58890
Description: 250 kDa transmembrane protein with multiple fragments bound

Job options:

Type of job: DynaMight Flexbility
Number of MPI processes: 1
Number of threads: 8
Full command: relion_python_dynamight optimize-deformations --refinement-star-file Refine3D/job025/run_data.star --output-directory DynaMight/job027/ --initial-model PostProcess/job026/postprocess_masked.mrc --n-gaussians 21000 --initial-threshold 0.0025 --regularization-factor 1 --n-threads 8 --preload-images --gpu-id 0 --pipeline-control DynaMight/job027/

Error message:

Initializing the particle dataset
Assigning a diameter of 299 angstrom
Number of particles: 58890
Initialized data loaders for half sets of size 26500  and  26501
consensus updates are done every  1  epochs.
box size: 320 pixel_size: 0.996 virtual pixel_size: 0.003115264797507788  dimension of latent space:  5
Number of used gaussians: 21000
Optimizing scale only
Initializing gaussian positions from reference
100%|##########| 50/50 [00:31<00:00,  1.60it/s]
Final error: 2.1116328241532756e-07
Optimizing scale only
Initializing gaussian positions from reference
100%|##########| 50/50 [00:31<00:00,  1.61it/s]
Final error: 2.1116328241532756e-07
consensus gaussian models initialized
consensus model  initialization finished
mean distance in graph for half 1: 1.8233627080917358 Angstrom ;This distance is also used to construct the initial graph 
mean distance in graph for half 2: 1.8233627080917358 Angstrom ;This distance is also used to construct the initial graph 
Computing half-set indices
100%|##########| 47/47 [00:03<00:00, 12.25it/s]
 98%|#########7| 46/47 [00:14<00:00,  3.27it/s]
Index tensor must have the same number of dimensions as self tensor

Dec 13 '23 22:12 sunchang1990

Possibly related: https://github.com/3dem/DynaMight/issues/8

Dec 13 '23 23:12 biochem-fan

Hi Sunchang,

This is probably a bug, where the batch size number is such that there is a single image in one batch. I will fix that today. A workaround would be to force the batch size to a number such that the remainder is not 1 for all dataloaders. I think --batch-size 120 would work in your case.

I hope this helps,

Johannes

Dec 14 '23 10:12 schwabjohannes

Thank you, Johannes, for your quick response!

I can confirm that by adding the --batch-size 120 option, DynaMight training appears to run correctly.

For your information, one new warning message does pop up, but it doesn't stop the training.

/relion5.0/lib/python3.10/sitepackages/dynamight/deformations/optimize_deformations.py:1295: 
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() 
or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). 
'indices_half1': torch.tensor(half1_indices)

Dec 14 '23 18:12 sunchang1990

relion relion copied to clipboard

DynaMight running issue "Index tensor must have the same number of dimensions as self tensor"

relion
relion copied to clipboard