plasma-python icon indicating copy to clipboard operation
plasma-python copied to clipboard

Clarifying where to preprocess

Open rkube opened this issue 3 years ago • 7 comments

Preprocessing results on too much compute load for the traverse head node.

rkube avatar Apr 25 '22 14:04 rkube

Can one of the admins verify this patch?

buildbot-princeton avatar Apr 25 '22 14:04 buildbot-princeton

To preprocess the dataset on traverse I need to limit the number of threads used for preprocessing https://github.com/PPPLDeepLearning/plasma-python/issues/82

rkube avatar Apr 25 '22 18:04 rkube

There are 44 cores on a node of Traverse, right? Any reason why we can only spawn 32 threads?

felker avatar Apr 25 '22 18:04 felker

Also I am in favor of not changing the default conf.yaml to make it specific to Princeton-based systems. So:

fs_path: '/Users/'
...
max_cpus: -1

(/Users/ isn't an ideal default, but it is generic-enough. Maybe should be set to $HOME, would need to check the parsing logic)

felker avatar Apr 25 '22 18:04 felker

Each traverse node has 2 processors, 16 cores per processor and 4 threads per core. When I run pre-processing with 126 threads it starts off well but throws errors after a while. May be running into memory limits?

rkube avatar Apr 25 '22 18:04 rkube

Ah, I had assumed that the CPU model was the same as on Summit. What do you get when you run lscpu and cat /proc/cpuinfo on a Traverse compute node (just curious)?

But this problem is likely because of the 4-way SMT, which wasnt on the Tiger cluster, which the code was originally written for.

felker avatar Apr 25 '22 20:04 felker

Summit and traverse are very similar, but no 100% identical.

(frnn) [rkube@traverse examples]$ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        6
Model:               2.3 (pvr 004e 1203)
Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127
NUMA node252 CPU(s): 
NUMA node253 CPU(s): 
NUMA node254 CPU(s): 
NUMA node255 CPU(s): 
(frnn) [rkube@traverse examples]$ cat /proc/cpuinfo 
processor       : 0
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)

processor       : 1
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)

processor       : 2
cpu             : POWER9, altivec supported
clock           : 3683.000000MHz
revision        : 2.3 (pvr 004e 1203)
...
processor       : 127
cpu             : POWER9, altivec supported
clock           : 3533.000000MHz
revision        : 2.3 (pvr 004e 1203)

timebase        : 512000000
platform        : PowerNV
model           : 8335-GTH
machine         : PowerNV 8335-GTH
firmware        : OPAL
MMU             : Radix

rkube avatar Apr 25 '22 21:04 rkube