particleflow
particleflow copied to clipboard
pytorch evaluation sometimes segfaults
When running evaluation of a trained model:
export CUDA_VISIBLE_DEVICES=MIG-632c6fca-30c4-5fce-97be-8dab51b1f2f6
singularity exec -B /scratch/persistent --nv \
--env PYTHONPATH=hep_tfds \
--env KERAS_BACKEND=torch \
$IMG python3.10 mlpf/pyg_pipeline.py --dataset cms --gpus 1 \
--data-dir /scratch/persistent/joosep/tensorflow_datasets --config parameters/pytorch/pyg-cms.yaml \
--test --make-plots --conv-type attention --gpu-batch-multiplier 10 --num-workers 1 --prefetch-factor 10 --load $WEIGHTS --test-datasets cms_pf_ttbar --ntest 50000 &> logs/eval_cms_pf_ttbar.txt
It randomly segfaults at some point:
76%|███████▋ | 3822/5000 [52:51<29:31, 1.50s/it]
./scripts/tallinn/a100/pytorch-small.sh: line 34: 147149 Segmentation fault (core dumped) singularity exec -B /scratch/persistent --nv --env PYTHONPATH=hep_tfds --env KERAS_BACKEND=torch $IMG python3.10 mlpf/pyg_pipeline.py --dataset cms --gpus 1 --data-dir /scratch/persistent/joosep/tensorflow_datasets --config parameters/pytorch/pyg-cms.yaml --test --make-plots --conv-type attention --gpu-batch-multiplier 10 --num-workers 1 --prefetch-factor 10 --load $WEIGHTS --test-datasets cms_pf_ttbar --ntest 50000 &> logs/eval_cms_pf_ttbar.txt
If you try again, it passes that batch, but segfaults later. It's not a showstopper, but it's annoying and would be great to understand why this is happening.