hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

simple test for functionality?

Open asedova opened this issue 5 years ago • 48 comments

Hi, I am trying to troubleshoot performance of hhblits on cluster and supercomputer resources at the ORNL OLCF. We are finding that the newest version is running really slowly despite moving databases onto compute node NVME. I'm wondering if the user's sequence is just not getting any hits? Is there a very basic, short test I can run that is guaranteed to exit relatively quickly so I can optimize the thread count, run config options, etc.? I have not found anything online. I'm using the 2017_10 uniclust30 so that's already smaller than other options, and I think I have the input args set to run quicker, but it's still really bogged down.

Thanks!

asedova avatar Feb 20 '20 06:02 asedova

Could you provide a short summary of specs of the compute nodes and upload the command-line call and output log of the slow HHblits run?

Which version are you comparing to? What database did you previously use with the older version?

milot-mirdita avatar Feb 20 '20 12:02 milot-mirdita

So I believe the previous version used was 3.1.0, and the current one in 3.2. I am trying on the Summit computer, IBM AC922, it has 42 Power9 CPUs per node with 4 hardware threads each (and 6 GPUs which I am not using here of course). I finally tried the example on the Hummer server last night to make sure the sequence was ok. It ran Jackhmmer in a few minutes, and then I ran the same sequence through HHblits on your server, and also this was done in a few minutes. I then used the same command line args that the output from your server gave:

export OMP_NUM_THREADS=10; jsrun -n 1 -a 1 -c 42 -r 1 -l CPU-CPU -d packed -b rs --smpiargs="none" ~/HHSuite/hh-suite-3.2/build/bin/hhblits -cpu 10 -i abl.fasta -n 1 -e 1e-3 -p 20 -Z 250 -z 1 -b 1 -B 250 -d /mnt/bb/adaa/uniclust30_2017/uniclust30_2017_10/uniclust30_2017_10 The data has been moved onto the compute node's NVMe (that is the /mnt/bb...) and verified that it is there. HHblits starts and gives these messages to stdout:

  • 03:00:55.047 INFO: Search results will be written to abl.hhr

  • 03:01:06.674 INFO: Searching 13605227 column state sequences.

  • 03:01:06.775 INFO: abl.fasta is in A2M, A3M or FASTA format

  • 03:01:06.775 INFO: Iteration 1

  • 03:01:06.823 INFO: Prefiltering database

  • 03:01:24.130 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment) : 13468324

Then sits for over half and hour or more until I kill it. I have tried with several different thread counts, 10-42. SSH onto compute node and running top shows hhblits running at num threads*100.

Here I tried with both the 2020 uniclust30 and the 2017_10 one which is smaller.

I built with gcc 6.4 which is the default gcc on Summit but we do have up to 9 installed.

Oh, and the user was running the previous version on our commodity cluster (intel haswell cps, red hat) using the previous hhsuite database (I think it's 2019). But the new version is also doing strange thing on that cluster for them, only different. There the data can be copied to /dev/shm but this time top give close to 0% CPU usage and also, hhblits runs very slowly. We very recently moved from PBS to slurm there and everything is not quite right which is why I tried on Summit, which is stable and tested in terms of scheduler, job launcher, etc.

asedova avatar Feb 20 '20 13:02 asedova

Could you request an interactive session on your cluster and run then run:

gdb --args  ~/HHSuite/hh-suite-3.2/build/bin/hhblits -cpu 10 -i abl.fasta -n 1 -e 1e-3 -p 20 -Z 250 -z 1 -b 1 -B 250 -d /mnt/bb/adaa/uniclust30_2017/uniclust30_2017_10/uniclust30_2017_10

Then press r and enter. Once it reaches the stages where nothing is seemingly happening, could you interrupt the process with CTRL+C and then print a process backtrace with bt and enter. Then post the backtrace here please.

After generating a backtrace you can continue the process by executing c and enter and then repeating the commands starting at CTRL+C. If anything changes a lot between backtraces that would also be interesting to find the cause of the slowdon.

For this to yield useful information, please make sure that cmake was compiled with the correct build type. The call should have been:

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. (+other flags you might have set)

milot-mirdita avatar Feb 20 '20 14:02 milot-mirdita

Ok, I'm trying to get you a stack trace. The GDB seems to be installed incompletely and all I'm getting right now is: Program received signal SIGINT, Interrupt. 0x00002000006378b8 in poll () from /usr/lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install ibm_smpi-jsm-10.03.00.01rtm3-rh7_20190611.ppc64le (gdb) bt #0 0x00002000006378b8 in poll () from /usr/lib64/libc.so.6 #1 0x0000000010006084 in main ()

I also tried with ARM FORGE MAP and it seems to be having trouble with the fact that I am running a non-MPI job and with the jsrun input args.

I am going to get help from sys admin about this and get back to you with a stack trace, may take a few days.

Thanks for the help!

asedova avatar Feb 20 '20 16:02 asedova

Are you running the cluster wrapper (jsrun) inside gdb? That shouldn't really work.

In our SLURM I would call something like srun -N 1 --pty /bin/bash and then run GDB from inside this interactive session.

You could also try to completely delete and recreate the cmake build folder to make sure that are no previous compilation steps without debug symbols left.

milot-mirdita avatar Feb 20 '20 16:02 milot-mirdita

Yes, in order to actually run on the compute nodes even in an interactive job on Summit you have to use jsrun. For instance, even hostname returns a different thing with/without jsrun inside an interactive job: bash-4.2$ jsrun -n1 hostname h50n18 bash-4.2$ hostname batch5

Forge map is also run on top of jsrun: map --profile jsrun -n 128 ./myapp a b c

I can try it on the batch node but it won't be able to use the NVME then, and it won't mirror the usual compute environment.

The debug build was from a fresh git clone in a new dir.

asedova avatar Feb 20 '20 16:02 asedova

Here's the trace from running on the batch node: #0 v_movemask (x=...) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/simd.h:460 #1 Prefilter::swStripedByte(unsigned char*, int, unsigned char*, int, unsigned short, unsigned short, int __vector(4), int __vector(4), int __vector(4), unsigned short) (this=0x107e21e0, querySeq=0x2c8c8d20 "22)+0-10))/&+2,+5+,,,1=0-/2))2<3.)--.--0.0.-2/)))/1/.20110--21,->-0./=/<1./20.'+/,101;-40122+-//++,+/+>//,2.+/.1..))/).,,/201=2=..%/,''10200/=0,=.((&/022*=-/=0///.-*-=2220+0..//10./2//200..F0.0"..., queryLength=163, dbSeq=0x2000bb23c9a3 "\025\016\f\a\n\021\b\n\f\002\001\023\006\006\022\t\005\023\017\022\020", dbLength=, gapOpen=24, gapExtend=4, pvHLoad=0x2c8d2760, pvHStore=0x2c8d2810, pvE=, bias=) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/hhprefilter.cpp:177 #2 0x000000001009f6f0 in Prefilter::prefilter_db () at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/hhprefilter.cpp:502 #3 0x0000200000581074 in GOMP_parallel ( fn=0x1009f460 <ZN9Prefilter12prefilter_dbEP3HMMP4HashI3HitEiiiiiddiiiPA20_KfRSt6vectorISt4pairIiNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEESaISH_EESK._omp_fn.1(void)>, data=0x7ffffff650c0, num_threads=10, flags=) at /sw/summit/gcc/6.4.0/src/gcc-6.4.0/libgomp/parallel.c:168 #4 0x00000000100a0928 in Prefilter::prefilter_db (this=0x107e21e0, q_tmp=, previous_hits=0x2b153060, threads=, prefilter_gap_open=, prefilter_gap_extend=, prefilter_score_offset=, prefilter_bit_factor=, prefilter_evalue_thresh=1000, prefilter_evalue_coarse_thresh=100000, preprefilter_smax_thresh=, min_prefilter_hits=, maxnumdb=, R=, new_prefilter_hits=..., old_prefilter_hits=...) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/hhprefilter.cpp:488 #5 0x000000001007cb38 in HHblitsDatabase::prefilter_db (this=0x107e2410, q_tmp=, previous_hits=, threads=, ---Type to continue, or q to quit--- prefilter_gap_open=, prefilter_gap_extend=, prefilter_score_offset=, prefilter_bit_factor=, prefilter_evalue_thresh=, prefilter_evalue_coarse_thresh=, preprefilter_smax_thresh=10, min_prefilter_hits=100, maxnumbdb=20000, R=0x7ffffff6a840, new_entries=std::vector of length 0, capacity 0, old_entries=std::vector of length 0, capacity 0) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/hhdatabase.cpp:189 #6 0x000000001001de4c in HHblits::run (this=0x7ffffff6a1e0, query_fh=, query_path=) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/hhblits.cpp:1160 #7 0x0000000010011d54 in main (argc=, argv=) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/hhblits_app.cpp:57 (gdb)

asedova avatar Feb 20 '20 16:02 asedova

If you repeatedly print a stacktrace, does it always begin with #0 v_movemask?

milot-mirdita avatar Feb 20 '20 16:02 milot-mirdita

Actually I left out this part: Program received signal SIGINT, Interrupt. v_movemask (x=...) at /ccs/home/adaa/HHSuite/hhsuite-3.2-debug/hh-suite/src/simd.h:460 460 result |= (t.as_char[0] & 0x80) >> 7;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.ppc64le

Let me rerun and print repeated... sec...

asedova avatar Feb 20 '20 16:02 asedova

Yes, always seems to start with #0 v_movemask (x=...)

asedova avatar Feb 20 '20 16:02 asedova

Hm okay interesting. So the implementation is either just inefficient or it’s wrong and results in an infinite loop.

It does it eventually finish, right?

Nothing changed in the ppc64le code since the original implementation. So it shouldn’t have a performance difference.

milot-mirdita avatar Feb 20 '20 17:02 milot-mirdita

I have not had it finish yet but I have only waited for about 1.5 hours. There's a 2 hour limit for small jobs on Summit.

asedova avatar Feb 20 '20 17:02 asedova

That's why I was wondering if there was a small test to run... like on a tiny database.

asedova avatar Feb 20 '20 17:02 asedova

Look at the script in data/test.sh. It creates a single sequence database. You could also download the scop db from out FTP. It contains about 6k sequences.

If it’s just slow, then we can probably replace the instruction with a popcount on ppc and regain most of the speed.

milot-mirdita avatar Feb 20 '20 17:02 milot-mirdita

ok will do

asedova avatar Feb 20 '20 17:02 asedova

Looks like there's a good chance it's an infinite loop. With scop40 I let it run for almost 2 hours and no progress. With the single sequence db generated following data/test.sh, I have been running for an a hour and still see only this:

  • 16:07:43.141 INFO: Prefiltering database

  • 16:07:43.186 INFO: HMMs passed 1st prefilter (gapless profile-profile alignment) : 1

asedova avatar Feb 20 '20 21:02 asedova

@dmiller423 would you have time to look at this issue? I've tried to reproduce it inside qemu-ppc64le, without success. I tried replacing the movemask function with a different implementation I found in OpenCV (https://github.com/opencv/opencv/pull/15235):

inline int v_movemask(simd_int value) {
    static const simd_u8 qperm = {120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 0};
    return vec_extract((simd_int)vec_vbpermq((simd_u8)value, qperm), 2);
}

That didn't yield any success. I tried also rerunning without this change and I am getting a wrong alignment (e.g. running hhalign -i data/query.a3m -t data/query.a3m results in different e-values/scores compared to SSE/AVX).

milot-mirdita avatar Feb 20 '20 22:02 milot-mirdita

I can run any tests; I really want this to be fixed because we have great hopes for running very large datasets on Summit, especially with the burstbuffer and large node memory capabilities. I'm not sure if I know enough about the code to do anything useful though. Another option is to get you guys access to the machine to run tests.

asedova avatar Feb 20 '20 22:02 asedova

@milot-mirdita I don't have a lot of free time currently, but yes I will look into it as soon as I have a few minutes free.

@asedova : this has changed in a recent version? the older version is working / faster?

I know it passed the tests when I added / was working, replacing v_movemask() is indeed a good idea. I believe there were some bugs in intrinsics when I added I was working around to use a 'C' version. Or maybe the workarounds I remember were for OpenCV. I'm not sure it's been ~3years since commit.

Either way when I get a chance I'll look, there is a simple repro for this stall?

dmiller423 avatar Feb 21 '20 14:02 dmiller423

@dmiller423 this is the first time I have tried to build hhsuite on Summit, our IBMAC922. The previous release (3.1) was not compatible with ppc64 (I tried, it fails with SSE simd errors).

I get the same problem with any run of hhblits using any sequence, database, or input args.

[[email protected] hhsuite-3.2-debug]$ git clone https://github.com/soedinglab/hh-suite.git [[email protected] build]$ module load gcc [[email protected] build]$ module load cmake

Currently Loaded Modules:

  1. hsi/5.0.2.p5 4) DefApps 7) spectrum-mpi/10.3.0.1-20190611
  2. lsf-tools/2.0 5) gcc/6.4.0 8) forge/20.0.1
  3. darshan-runtime/3.1.7 6) cmake/3.15.2

[[email protected] build]$ cmake -DCMAKE_C_COMPILER=mpicc -DCMAKE_C_COMPILER=mpicc -DCMAKE_BUILD_TYPE=RelWithDebInfo ..

(make, make install) (interactive job)

export OMP_NUM_THREADS=10; jsrun -n 1 -a 1 -c 42 -r 1 -l CPU-CPU -d packed -b rs --smpiargs="none" ~/HHSuite/hh-suite-3.2/build/bin/hhblits -cpu 10 -i abl.fasta -n 1 -e 1e-3 -p 20 -Z 250 -z 1 -b 1 -B 250 -d /mnt/bb/adaa/uniclust30_2017/uniclust30_2017_10/uniclust30_2017_10

asedova avatar Feb 21 '20 14:02 asedova

@milot-mirdita @dmiller423 I added a print statement in the v_max function on line 430 and it returns an infinite number of ints all having value 64511. So we definitely have an infinite loop.

The function itself as a stand alone works fine; we tested it on two Power9 systems. So the error is propagating from somewhere else in the code.

asedova avatar Feb 22 '20 03:02 asedova

@milot-mirdita @dmiller423 One of our simd experts has been looking at the code and it seems that part of the problem is the shift left vector instruction. In Altivec it shifts bits and in x86 it shifts bytes. He is going to follow up probably on this issue and as a PR. I just wanted to let you know asap because there's a good chance that anyone using ppc version is getting wrong answers.

asedova avatar Feb 25 '20 21:02 asedova

I tried to replace the shifts in simdf32_flog2, simdf32_fpow2 and log2f4 by 22 and 23 bits with two separate shifts one by two bytes and another one by either 6 or 7 bit. And also to fix the other shifts in Prefilter::swStripedByte to shift by 1 byte.

Running hhalign with the example query against itself seems to still produce a wrong score.

hhalign -i data/query.a3m -t data/query.a3m

Either I used the shifts wrong or we have further problems somewhere. I am out of my depth with AltiVec and we have to rely on your team's expertise to fix the issue.

milot-mirdita avatar Feb 25 '20 23:02 milot-mirdita

I've gone through a bit of the code and fixed the errors from the incorrect bit shifting instructions from the src/simd.h file for Altivec. This fixed the infinite looping found in the swStripedByte function and the error from the log2f4 function. However, I'm still getting differing output on the test cases between x86 and PPC. For example, I get the following output when I run the example query

hhalign -i data/query.a3m -t data/query.a3m

On Summit with the changes

No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM 1 sp|Q5VUD6|FA69B_HUMAN Protein 100.0 7E-80 7E-80 598.0 0.0 431 1-431 1-431 (431)

And on my laptop

No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM 1 sp|Q5VUD6|FA69B_HUMAN Protein 100.0 9E-168 9E-168 1246.9 0.0 431 1-431 1-431 (431)

Looking at some logs I've found a few values are off by about 0.00001 during the calculations between the x86 and PPC versions. I'm not sure precisely what's causing it but I've noticed some differences between how the preprocessor is ordering the instructions between the builds on both architectures. I've noticed the errors occur at ./src/hhviterbialgorithm.cpp:295 and ./src/hhutil-inl.h:530.

For now, I've submitted a pull request to simply change the shifting instructions. Let me know if you have and ideas.

jhuber6 avatar Feb 27 '20 21:02 jhuber6

I found another Altivec bug. The simdi8_set(x) macro wasn't calling the byte version of vec_splats because the cast to unsigned char wasn't being applied to the full expansion of the x term and would create a vector of 4 integer values rather than 16 bytes. I am still getting incorrect answers however.

jhuber6 avatar Feb 28 '20 20:02 jhuber6

@martin-steinegger @milot-mirdita @dmiller423 would it be possible to set up a conference call so we can discuss what to do next about the ppc/Summit hhsuite bugs? Joseph has done an outstanding job finding bugs but he feels he is beyond his knowledge now with respect to knowledge of the code, and we are still getting different answers on Summit compared to x86. It would be great to get some input from the developers. We may be getting really close to a functional ppc version.

asedova avatar Mar 02 '20 19:03 asedova

@asedova lets discuss the topic in a conference call. Could you please send an e-mail to setup a date? themartinsteinegger gmail com

martin-steinegger avatar Mar 03 '20 23:03 martin-steinegger

@jhuber6 is started taking some of this work to get MMseqs2 to run on powerpc.

I think

#define simdi32_set4(x,y,z,t) (simd_int){(x),(y),(z),(t)}

Might be still wrong. The equivalent _mm_set_epi32 intrinsic (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_set_epi32&expand=4347,4372,4905) does essentially:

#define simdi32_set4(x,y,z,t) (simd_int){(t),(z),(y),(x)}

Any idea if this is right or wrong?

milot-mirdita avatar Mar 15 '20 01:03 milot-mirdita

That should be correct. I checked the PPC disassembly just to make sure and it basically just places an array made from brace list in static memory and does a vector load on it. Since in the x86 instruction the last argument to the function is placed at the beginning of the vector register it's reversed on PowerPC.

jhuber6 avatar Mar 15 '20 02:03 jhuber6

Sorry I've been away over a week here, out of town for funeral. Trying to catch up quickly, It seems like you are working out the issues here. Is the machine you're using to test running in big endian mode or little? I will try and put aside some time this week to look into this further if you don't have it solved.

On Sat, Mar 14, 2020 at 10:38 PM jhuber6 [email protected] wrote:

That should be correct. I checked the PPC disassembly just to make sure and it basically just places an array made from brace list in static memory and does a vector load on it. Since in the x86 instruction the last argument to the function is placed at the beginning of the vector register it's reversed on PowerPC.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/soedinglab/hh-suite/issues/187#issuecomment-599159894, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHYC55DKRVSTBGMNI2GDELRHQ5SVANCNFSM4KYIOAIA .

dmiller423 avatar Mar 15 '20 18:03 dmiller423