powderday icon indicating copy to clipboard operation
powderday copied to clipboard

Interpolating Issue

Open ACDylan opened this issue 3 years ago • 14 comments

Hi - I have my High Resolution "mother simulation" ; however when I runned a snapshot with pdd, it was still running after 3 days. By canceling it, the job script gives me:

image image

The second image being where the simulation stopped.

Is it because of a parameter?

ACDylan avatar Sep 27 '21 13:09 ACDylan

can you run powderday on this snapshot interactively? does it hang at some point if you do?

dnarayanan avatar Sep 27 '21 13:09 dnarayanan

By 'interactively', you mean running with the terminal console and not a job? If so, it blocked my terminal after the first line Interpolating (scatter) SPH field PartType0: 0it [00:00, ?it/s], the latter running indefinitely.

ACDylan avatar Sep 27 '21 13:09 ACDylan

hmm interesting. how many particles is the snapshot? this seems to be hanging in yt (though I've never seen it take 3 days to deposit the octree before).

in a terminal how long does this take to finish running (i.e. does it ever finish?)

import yt
ds = yt.load(snapshotname)
ad = ds.derived_field_list

dnarayanan avatar Sep 27 '21 13:09 dnarayanan

PartType0: 13,870,234 PartType1: 10,000,000 PartType2: 10,000,000 PartType3: 1,250,000 PartType4: 1,584,425

>>> ad = ds.derived_field_list
yt : [INFO     ] 2021-09-27 22:18:46,988 Allocating for 3.670e+07 particles
yt : [INFO     ] 2021-09-27 22:18:46,988 Bounding box cannot be inferred from metadata, reading particle positions to infer bounding box
yt : [INFO     ] 2021-09-27 22:18:50,997 Load this dataset with bounding_box=[[-610.44433594 -612.21533203 -614.03771973], [616.07244873 612.08428955 614.15777588]] to avoid I/O overhead from inferring bounding_box.
Loading particle index: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 53/53 [00:00<00:00, 371.52it/s]

Around a second to load it. I can try again to run a simulation in a terminal.

Edit: Maybe this is coming from yt : [INFO ] 2021-09-20 22:32:40,241 Octree bound 31193650 particles

I don't know why is there so much particles. At least, gizmo's snapshot simulation have around 1 million Octree particles, here it is 31M.

ACDylan avatar Sep 27 '21 13:09 ACDylan

My lab gave me a zoom-in simulation (while previous simulation is still processing, I have increased the number of cores) and as you can see, it also takes a lot of time for the interpolation.

image

I'll keep you informed!

ACDylan avatar Sep 30 '21 15:09 ACDylan

are there any updates for this, or shall I close the issue?

dnarayanan avatar Feb 18 '22 16:02 dnarayanan

Hi @ACDylan and @dnarayanan, I'm trying to run Powderday on Gadget-4 HDF5 snapshots and I've got the same issue, was there a solution for this?

aussing avatar Sep 05 '22 05:09 aussing

Hi - hmmm no I never heard from @ACDylan again so I'm not sure what the issue is.

@aussing do you have a snapshot that you can easily share so that I can play with it and see if I can get to the bottom of this? also please let me know what powderday and yt hash you're on.

thanks!

dnarayanan avatar Sep 06 '22 13:09 dnarayanan

Here is a dropbox link to the snapshot file: https://www.dropbox.com/s/54d8hlu54ojf16d/snapshot_026.hdf5?dl=0 It's 5.7GB, but I can find a smaller snapshot file if need be.

The Powderday hash is 2395ae703e9952111bc99542f0cd14a18590fd50, and I installed yt through conda, I'm using version 4.0.5 and the build number is py38h47df419_0. To get a hash I used conda list --explicit --md5 which returned df416a6d0cabb9cc483212f16467e516

aussing avatar Sep 07 '22 07:09 aussing

Hi @dnarayanan, I've discovered something that may or may not be related but running Powderday on our HPC system with Slurm only runs on 1 CPU, even when I requested 16 and specified 16 in the Parameters_master file

aussing avatar Sep 20 '22 05:09 aussing

Hi - I'm guessing that this actually has to do with how this is being called on your specific system.

are you setting 16 as n_processes or n_MPI_processes ? it looks like it's getting stuck in a pool.map stage, which would correspond to the former.

dnarayanan avatar Sep 20 '22 11:09 dnarayanan

Both were set to 16

aussing avatar Sep 20 '22 23:09 aussing

Hi,

I wonder if the issue is actually how you're calling the slurm job. Here'a an example for a job where I'm calling 32 pool, 32 MPI:

#! /bin/bash
#SBATCH --account narayanan
#SBATCH --qos narayanan-b
#SBATCH --job-name=smc
#SBATCH --output=pd.o
#SBATCH --error=pd.e
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --time=96:00:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=32
#SBATCH --mem-per-cpu=7500
#SBATCH --partition=hpg-default

you may want to contact your sysadmin to find out the best slurm configuration to see if this can be resolved on the side of your HPC.

dnarayanan avatar Sep 21 '22 11:09 dnarayanan

Hi @dnarayanan, I'm still not sure why the code is only running on one CPU, but as far as the original interpolating issue, I solved it by setting n_ref to 256 instead of the default 32.

I ran into a separate issue where I got 'WARNING: photon exceeded maximum number of interactions - killing [do_lucy]' in the pd.o file, but I'm able to get around that by setting SED = False.

Edit: the photon interaction warning seems to come up with several different parameters turned on while keeping SED = False, I'm trying to track that down at the moment. -> Also setting Imaging = False

aussing avatar Oct 10 '22 01:10 aussing