tmLQCD JuQueen I/O test conclusions

This issue will collect conclusions from the I/O test.

Mar 14 '13 14:03 kostrzewa

@urbach, @deuzeman

I ran the tests on the pra073 contingent using about 0.5 RD in total (probably a little less than that). My tests haven't fully completed but after reading and writing about a PB of data I haven't had a single failure with either LEMON or LIME. Note that I used the unmodified LEMON version which still had the potential integer overflow bug! (I did this on purpose to have a baseline)

The tests involve 9 test configurations each which are first read 5 times in a row. Then read again and written to disk, read back and compared.

I tested:

L48T96

512 ranks, hybrid, LEMON 32768 ranks, pure MPI, LEMON 512 ranks, hybrid, LIME 32768 ranks, pure MPI, LIME - aborted, this was just too slow but no problems in the 40 minutes that it ran for

L64T128

1024 ranks, hybrid, LEMON 65536 ranks, pure MPI, LEMON 1024 ranks, hybrid, LIME

I have to say that the GPFS is really impressive. It might use some sort of very large RAM cache because even with 3 jobs reading and writing concurrently I get speeds of several GB/s! This, however, might also be a bit of a limitation of this test. Maybe it would be worthwhile to introduce random waiting periods to make sure that the cache has been written to disk.

Mar 14 '13 15:03 kostrzewa

@urbach For the LIME run it takes a very long time (~180s or more for 48x96) to write configurations. Was the lock-up maybe just one that appeared to be a lock-up?

Mar 14 '13 16:03 kostrzewa

hmm, I don't know actually. I tried now with the new lemon version and so far it seems to be working.

Mar 14 '13 16:03 urbach

That's sounding very promising! Any indication of the performance difference between the hybrid and pure MPI codes?

As an aside, I just got an email from David about their issues. It seems he sent it just to me, so I think it would be good to share this.

From his description:

The point is that we have got IO errors when using lemon and a 24^3x48 lattice in Fermi (BGQ) with 512 mpi processes and 64 openmp threads. I don't know if that is expected or not, but I thought the bug you were trying to catch happens only when the local lattice is too large... Indeed, that was what we had seen up to now: this type of error occured when running a 48^3x96 lattice in the same (or double) partition.

And the associated error message:

WARNING, writeout of .conf.tmp returned no error, but verification discovered errors. For gauge file .conf.tmp, calculated and stored values for SciDAC checksum A do not match. Calculated : A = 0xbc7d4996 B = 0x00f2cc1a. Read from LIME headers: A = 0xefb54a75 B = 0xcd6124fc. Potential disk or MPI I/O error. Aborting...

So it seems the old "lemon" bug is back. Have either of you seen this?

Mar 14 '13 16:03 deuzeman

Ah, and do we know if there was a firmware upgrade during the last maintenance cycle? Could it be that this still has to be done at Fermi? Wishful thinking here...

Mar 14 '13 17:03 deuzeman

That's a very large partition to run a 24^3x48 on..

I guess I'll try a smaller volume too just to be on the safe side.
It would also be wise to repeat the test at CINECA.
I will also try the test on one rack but with the L48T96 volume.
Finally, these were all runs with 4D parallelization. Maybe the fact that they are using 3D is the culprit?

That's sounding very promising! Any indication of the performance difference between the hybrid and pure MPI codes?

The performance difference is roughly a factor of two with the hybrid code being faster. Hybrid LIME performance is actually not that terrible (about a factor of 4-6 I guess)

So it seems the old "lemon" bug is back. Have either of you seen this?

Hmm... one of the test configurations that I used (the hot start test of Dxx in gregorio's ARCH, configuration 1192) had a mismatch, but this was probably caused during the writing of that configuration.

Ah, and do we know if there was a firmware upgrade during the last maintenance cycle? Could it be that this still has to be done at Fermi? Wishful thinking here...

I'm sure good records are kept of what exactly changed during the various upgrades the machine has undergone so far.

Mar 14 '13 17:03 kostrzewa

Hybrid LIME performance is actually not that terrible (about a factor of 4-6 I guess)

acutally, no, the writing performance is absolutely abysmal! (factor 80 or so...)

Mar 14 '13 17:03 kostrzewa

An interesting measure, performance scales from midplane to rack:

midplane

Reading gauge field conf.1199 for reread test. Iteration 1, reread 4
# Constructing LEMON reader for file conf.1199 ...
# Time spent reading 6.12 Gb was 2.52 s.
# Reading speed: 2.42 Gb/s (4.73 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199:
#   Calculated            : A = 0x87166557 B = 0xabe0e0e9.
#   Read from LIME headers: A = 0x87166557 B = 0xabe0e0e9.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.

# Writing gauge field to conf.1199.copy. Iteration 1, reread 4
# Constructing LEMON writer for file conf.1199.copy for append = 0
# Time spent writing 6.12 Gb was 1.70 s.
# Writing speed: 3.60 Gb/s (7.03 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0x87166557 B = 0xabe0e0e9.
# Write completed, verifying write...
# Constructing LEMON reader for file conf.1199.copy ...
# Time spent reading 6.12 Gb was 1.83 s.
# Reading speed: 3.34 Gb/s (6.52 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0x87166557 B = 0xabe0e0e9.
#   Read from LIME headers: A = 0x87166557 B = 0xabe0e0e9.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Write successfully verified.

rack

Reading gauge field conf.1199 for reread test. Iteration 1, reread 4
# Constructing LEMON reader for file conf.1199 ...
# Time spent reading 19.3 Gb was 3.61 s.
# Reading speed: 5.35 Gb/s (5.22 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199:
#   Calculated            : A = 0xce02a1a2 B = 0x96879c6f.
#   Read from LIME headers: A = 0xce02a1a2 B = 0x96879c6f.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.

# Writing gauge field to conf.1199.copy. Iteration 1, reread 4
# Constructing LEMON writer for file conf.1199.copy for append = 0
# Time spent writing 19.3 Gb was 2.88 s.
# Writing speed: 6.70 Gb/s (6.55 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0xce02a1a2 B = 0x96879c6f.
# Write completed, verifying write...
# Constructing LEMON reader for file conf.1199.copy ...
# Time spent reading 19.3 Gb was 1.33 s.
# Reading speed: 14.5 Gb/s (14.2 Mb/s per MPI process).
# Scidac checksums for gaugefield conf.1199.copy:
#   Calculated            : A = 0xce02a1a2 B = 0x96879c6f.
#   Read from LIME headers: A = 0xce02a1a2 B = 0x96879c6f.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 64, LY = 64, LZ = 64, LT = 128.
# Write successfully verified.

Mar 14 '13 17:03 kostrzewa

Hmm.. I think I really need to add a delay there or something because the reread is so much faster... I think there's a strong possibilty we're reading from some cache... What's on disk might be completely borked.

Mar 14 '13 17:03 kostrzewa

Any indication of the performance difference between the hybrid and pure MPI codes?

The performance difference is roughly a factor of two with the hybrid code being faster.

In terms of total runtime it's more like a factor of 4-5 though!

hybrid: 00:38:20 pure MPI: 02:37:39

Mar 14 '13 17:03 kostrzewa

I'm now running successfully with lemon since yesterday. IO checks are enabled and no problem occurred so far. The time for a write of a 48^3x96 latice is 2.5 seconds (the lime write, when it worked, took 195 seconds) (all on a 512 node partition)

I'm not sure on which firmware/software version FERMI runs. JUQUEEN is on V1R2M0. Judging from the uptime and kernel version of the login node on FERMI, I'd guess that FERMI is on a stoneage version...

Mar 15 '13 09:03 urbach

To make sure: did we test what happened when reading a Lemon written ensemble with lime and vice versa? Or at least when reading in an older ensemble with the new version?

The reason I'm asking, is that I had to do some manual manipulation of the block sizes. I'm a little worried that the I/O processes might be reversible (since writing and reading use the same data layout), but that the data layout within the file is permuted with respect to the ILDG definition. It would be rather bad to discover this a few months from now...

Mar 15 '13 11:03 deuzeman

To make sure: did we test what happened when reading a Lemon written ensemble with lime and vice versa? Or at least when reading in an older ensemble with the new version?

The ensembles here were written with LEMON and I read them with both LIME and LEMON. I didn't test the converse.

As for the new version, I will test this later today, thanks for the heads up.

Mar 15 '13 12:03 kostrzewa

The ensembles here were written with LEMON and I read them with both LIME and LEMON. I didn't test the converse.

Thanks! I think that already covers the problem, actually. But it can't hurt to be thorough :).

Mar 15 '13 13:03 deuzeman

Hmm. I just managed to crash a midplane on reading with LEMON... and another one just now.. it didn't even manage to begin reading..

I did rewrite this on top of the smearing branch now and I'm using the buffers framework.

Mar 15 '13 13:03 kostrzewa

Really? That's worrying and weird. The only point of interference would be the definition of g_gf, rather than the previous underlying buffer. But I fail to see how that could cause the problem. Any error messages?

Mar 15 '13 13:03 deuzeman

No, it just locked up in two different places... I'll investigate some more. The scheduler was subsequently unable to kill the job and I guess the midplane was rebooted.

Mar 15 '13 13:03 kostrzewa

Oh, looky here in the MOTD:

*******************************************************************************
************
* Friday 15.3.13 13:36 GPFS read(!) access to /work hangs, write access is o.k.

*                - BG/Q jobs may abort due to IO errors
*                - BG/Q jobs may get stuck in REMOVE PRNDING, when being cancel
led
*                - front-end process hang when reading files
*  The situation is expected to be solved by Saturday morning 16.3.13.
*******************************************************************************
************

Not the fault of the smearing codebase then! Maybe my pummelling of the I/O subsystem yesterday was problematic after all.

Mar 15 '13 13:03 kostrzewa

Ahuh... Doesn't seem to make a huge amount of sense, but I guess we shouldn't be too worried then?

Mar 15 '13 13:03 deuzeman

Ahuh... Doesn't seem to make a huge amount of sense, but I guess we shouldn't be too worried then?

Well, sure. This means though that unless they fix it we can't really run without risking wasting computing time...

Mar 15 '13 13:03 kostrzewa

True, but it should only be half a day then.

Mar 15 '13 13:03 deuzeman

By the way, this is what I've come up with as a filesystem test for now:

https://github.com/kostrzewa/tmLQCD/blob/IO_test/test_io.c

Mar 15 '13 13:03 kostrzewa

Neat! I've already pointed David at this discussion and I think this would be a good tool to diagnose the issues at Fermi. It's a very handy piece of code to have lying around going forward, too.

Mar 15 '13 13:03 deuzeman

On Viernes marzo 15 2013 06:53:08 Albert Deuzeman escribió:

Neat! I've already pointed David at this discussion and I think this would be a good tool to diagnose the issues at Fermi. It's a very handy piece of code to have lying around going forward, too. — Reply to this email directly or view it on GitHub.

Hi, Thanks for the code! I'll run it on Fermi and let you know. Best,

David

Mar 15 '13 21:03 palao

Dear David,

thanks for running the test on FERMI. Please note that I've just updated the code and added a little bit of documentation (README.test_io) which should get you up and running. Just fetch my "IO_test" branch from github

Cheers!

Mar 17 '13 10:03 kostrzewa

for the records: on supermuc lemon appears to work and is 8 times or so faster than lime on a 512 node partition.

Mar 17 '13 13:03 urbach

True, but it should only be half a day then.

still broken... :) They actually even disabled logins now...

Mar 18 '13 08:03 kostrzewa

So I ran the test for very small local volumes (24^3x48) on a whole rack and very large local volumes (96^3x192) on one midplane and there are no failures on JuQueen. I think we can trust LEMON to do the right thing! I will update my LEMON branch now and run the standard test again just to make sure that there is no problem due to the update that was done.

Mar 20 '13 10:03 kostrzewa

Excellent! I'll wait for that final test, then we can officially push the new version of the library.

Mar 20 '13 13:03 deuzeman

sorry folks, bad news. For me the following occurred on 1024 nodes on juqueen with @deuzeman lemon branch:

in conf.0132 the following checksum is stored:

<scidacChecksum>
  <version>1.0</version>
  <suma>8d8fd1f6</suma>
  <sumb>59dc0fbb</sumb>
</scidacChecksum>

in the log file I find the following:

# Scidac checksums for gaugefield .conf.tmp:
#   Calculated            : A = 0x8d8fd1f6 B = 0x59dc0fbb.
# Write completed, verifying write...
# Constructing LEMON reader for file .conf.tmp ...
found header xlf-info, will now read the message
found header ildg-format, will now read the message
found header ildg-binary-data, will now read the message
# Time spent reading 6.12 Gb was 465 ms.
# Reading speed: 13.2 Gb/s (12.8 Mb/s per MPI process).
found header scidac-checksum, will now read the message
# Scidac checksums for gaugefield .conf.tmp:
#   Calculated            : A = 0x8d8fd1f6 B = 0x59dc0fbb.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.
# Reading ildg-format record:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.
# Input parameters:
#   Precision = 64 bits (double).
#   Lattice size: LX = 48, LY = 48, LZ = 48, LT = 96.  
# Write successfully verified.
# Renaming .conf.tmp to conf.0132.

so far, so good. Now the newly started job:

#   Calculated            : A = 0x67f40d51 B = 0x34000ed4.
#   Read from LIME headers: A = 0x8d8fd1f6 B = 0x59dc0fbb.

For me this seems to be a problem of the filesystem?! But, and that is actually really bad, it means lemon written configurations would not be safe?!

Mar 21 '13 10:03 urbach