particleflow icon indicating copy to clipboard operation
particleflow copied to clipboard

pytorch evaluation sometimes segfaults

Open jpata opened this issue 1 year ago • 0 comments

When running evaluation of a trained model:

export CUDA_VISIBLE_DEVICES=MIG-632c6fca-30c4-5fce-97be-8dab51b1f2f6
singularity exec -B /scratch/persistent --nv \
     --env PYTHONPATH=hep_tfds \
     --env KERAS_BACKEND=torch \
     $IMG python3.10 mlpf/pyg_pipeline.py --dataset cms --gpus 1 \
     --data-dir /scratch/persistent/joosep/tensorflow_datasets --config parameters/pytorch/pyg-cms.yaml \
     --test --make-plots --conv-type attention --gpu-batch-multiplier 10 --num-workers 1 --prefetch-factor 10 --load $WEIGHTS --test-datasets cms_pf_ttbar --ntest 50000 &> logs/eval_cms_pf_ttbar.txt

It randomly segfaults at some point:

 76%|███████▋  | 3822/5000 [52:51<29:31,  1.50s/it]
./scripts/tallinn/a100/pytorch-small.sh: line 34: 147149 Segmentation fault      (core dumped) singularity exec -B /scratch/persistent --nv --env PYTHONPATH=hep_tfds --env KERAS_BACKEND=torch $IMG python3.10 mlpf/pyg_pipeline.py --dataset cms --gpus 1 --data-dir /scratch/persistent/joosep/tensorflow_datasets --config parameters/pytorch/pyg-cms.yaml --test --make-plots --conv-type attention --gpu-batch-multiplier 10 --num-workers 1 --prefetch-factor 10 --load $WEIGHTS --test-datasets cms_pf_ttbar --ntest 50000 &> logs/eval_cms_pf_ttbar.txt

If you try again, it passes that batch, but segfaults later. It's not a showstopper, but it's annoying and would be great to understand why this is happening.

jpata avatar May 15 '24 05:05 jpata