cryodrgn Issue in multiprocessing for large datasets

Reporting an issue with dataset loading in cyrodrgn v0.3.3 which added multiprocessing to the initial image preprocessing steps.

Submitted command:

sbatch -n 1 -N 1 --mem 999G -t 24:00:00 --gres="gpu:4" -J test --wrap "cryodrgn train_vae data/v2/256/particles.256.txt --ctf data/v2/256/ctf.pkl --poses data/v2/256/poses.pkl --zdim 8 --amp --enc-dim 1024 --dec-dim 1024 -o v2/00_init --multigpu >> v2/00.log"

This is a ~550k, D=256 dataset. The compute node should have sufficient RAM (1TB) to process this dataset, but it looks like there is some issue with multiprocessing trying to chunk the dataset. The current workaround is to restore the previous behavior by turning off multiprocessing with --max-threads 1.

Traceback (most recent call last):
  File "/nobackup/users/zhonge/anaconda3/envs/cryodrgn4/bin/cryodrgn", line 33, in <module>
    sys.exit(load_entry_point('cryodrgn', 'console_scripts', 'cryodrgn')())
  File "/nobackup/users/zhonge/dev/cryodrgn/master/cryodrgn/__main__.py", line 54, in main
    args.func(args)
  File "/nobackup/users/zhonge/dev/cryodrgn/master/cryodrgn/commands/train_vae.py", line 327, in main
    data = dataset.MRCData(args.particles, norm=args.norm, invert_data=args.invert_data, ind=ind, keepreal=args.use_real, window=args.window, datadir=args.datadir, relion31=args.relion31, max_threads=args.max_threads, window_r=args.window_r)
  File "/nobackup/users/zhonge/dev/cryodrgn/master/cryodrgn/dataset.py", line 125, in __init__
    particles = np.asarray(p.map(fft.ht2_center, particles), dtype=np.float32)
  File "/nobackup/users/zhonge/anaconda3/envs/cryodrgn4/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/nobackup/users/zhonge/anaconda3/envs/cryodrgn4/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/nobackup/users/zhonge/anaconda3/envs/cryodrgn4/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/nobackup/users/zhonge/anaconda3/envs/cryodrgn4/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/nobackup/users/zhonge/anaconda3/envs/cryodrgn4/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Including some timing information after using --max-threads 1

2021-11-26 07:55:12     Loaded 542311 256x256 images
2021-11-26 07:55:12     Windowing images with radius 0.85
2021-11-26 07:55:48     Computing FFT
2021-11-26 08:16:28     Converted to FFT
2021-11-26 08:17:07     Symmetrizing image data
2021-11-26 08:22:31     Normalized HT by 0 +/- 377.5135192871094

Nov 26 '21 14:11 zhonge

Hello, I ran into the same problem today and tried your fix (1.55 M particles 128 box). Including --max-threads 1 repeatedly leads to a "killed" messages without any additional error messages or outputs when run from the command line or submitted to a node. This always seems to happen after computing FFT should be done (at least sometime after the message computing FFT appears).

2021-12-10 16:21:02 /fs/pool/pool-bmapps/hpcl8/app/soft/CRYODRGN/0.3.3/conda3/envs/cryodrgn/bin/cryodrgn train_vae particles_j102_box128.txt --ctf ctf_j102_box256.pkl --poses poses_j102_box256.pkl --zdim 8 -n 50 --multigpu - -max-threads 1 --amp -o vae128 2021-12-10 16:21:02 Namespace(activation='relu', amp=True, batch_size=8, beta=None, beta_control=None, checkpoint=1, ctf=' path/cryodrgn_allParts_1chamber/ctf_j102_box256.pkl', datadir=None, do_pose_sgd=False, domain='fourier', emb_type='quat', enc_mask=None, encode_mode='resid', func=<function main at 0x146ea1cb1050>, ind=None, invert_data=True, lazy=False, la zy_single=False, load=None, log_interval=1000, lr=0.0001, max_threads=1, multigpu=True, norm=None, num_epochs=50, outdir=' path/cryodrgn_allParts_1chamber/vae128', particles=' path/particles_j102_box128.txt', pdim=256, pe_dim=None, pe_type='geom_lowf' , players=3, pose_lr=0.0003, poses=' path/poses_j102_box256.pkl', preprocessed=False, pretrain=1, qdim=256, qlay ers=3, relion31=False, seed=56059, tilt=None, tilt_deg=45, use_real=False, verbose=False, wd=0, window=True, window_r=0.85, zdim=8) 2021-12-10 16:21:02 Use cuda True 2021-12-10 16:21:02 Loading dataset from path/cryodrgn_allParts_1chamber/particles_j102_box128.txt 2021-12-10 16:37:39 Loaded 1562002 128x128 images 2021-12-10 16:37:39 Windowing images with radius 0.85 2021-12-10 16:38:49 Computing FFT

Best Jonathan

Dec 10 '21 16:12 jowagner91

Hi, I got the same error and was glad to see I'm not the first one encountering it. I ran into the issue when going through the tutorial. When I tried running the high-resolution train_vae (126990 256x256 images), using a machine with 1 GPU, 32 CPUs and 64 GB RAM I got the connection error. When increasing to a machine with 1 GPU 32 CPUs and 128 GB RAM - I still got the error. Fortunately, scaling up to a machine with 1 GPU, 64 CPUs and 256 RAM has resolved the error. Hopefully my experience will be helpful to other users.

Oct 06 '22 09:10 lavibig