scr
scr copied to clipboard
Additional testing using pthreads and BBAPI
@kathrynmohror and @adammoody have identified some additional things we should test/verify with SCR:
- Test using AXL with pthreads to transfer checkpoints. Does it work correctly? Do we need to add logic to throttle the transfer so it doesn't interfere with application performance? (possibly not, but just want to check to see what the performance is like)
- Does the initiation and completion check for transfers work correctly? (at small and large scales?)
- Need to be able to cancel transfer of files on restart. Requires tracking set of transfer handles used (with redundancy to account for failures). Ensure everything was cancelled successfully in SCR_Init during restart. Ensure final transfers are cancelled or complete in the case of scavenge/post-run transfers.
- Check that SCR manages any final BBAPI transfer that is completed at the end of an allocation. In particular, SCR normally runs some finalization code to register that a transfer completed successfully, so that it knows the corresponding checkpoint is valid to be used during a restart in the next job allocation. I don't think we're accounting for that right now.
You can assign this to me for now
An early observation:
Below is the script I'm using to test the BBAPI transfer. To test, do srun -n4 -N4 <this script> from within the BB mount point:
#!/bin/bash
# Get BBAPI mount point
mnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd
cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE
SCR_CLUSTER_NAME=butte
SCR_FLUSH=1
STORE=/tmp GROUP=NODE COUNT=1
STORE=$mnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=/tmp/persist GROUP=WORLD COUNT=100
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE STORE=$mnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CNTLDIR=/tmp BYTES=1GB
CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF
SCR_CONF_FILE=~/myscr.conf ~/scr/build/examples/test_api
What I'm seeing is that when AXL tries to copy the file from /tmp/[blah]/ckpt to /mnt/[BB mount]/ckptdir/ckpt using BBAPI, it's falling back to using pthreads. This happens because AXL BBAPI first checks if the both source and destination support FIEMAP (file extents). If so, then it continues to use BBAPI to transfer the files. If not, then it falls back to using pthreads to transfer them. On cases where the destination file doesn't exist yet, it falls back to calling FIEMAP on the destination directory. This is problematic, as tmp and BBAPI filesystems don't support FIEMAP on directories, while ext4 does (which is what I originally tested on). So to get around this, we may want to fall back to doing a statfs() on the destination directory and checking f_type against a whitelist of filesystems that we know support extents.
Sounds good. Related to this, as far as I understand it, supporting extents is a necessary but not a sufficient requirement for the BBAPI. The BB software uses extents, but even if a file system supports extents, there is no guarantee that it will work with the BBAPI.
IBM says that the BBAPI should work for transfers between a BB-managed file system on the SSD and GPFS, but any other combination only works by chance if at all. No other combinations of transfer pairs have been tested or designed to work.
Some quirks I'm noticing using the BBAPI to transfer:
Can I transfer using BBAPI?
| src | dst | can xfer? |
|---|---|---|
| xfs | ext4 | Yes |
| ext4 | xfs | Yes |
| xfs | gpfs | Yes |
| ext4 | gpfs | No |
| gpfs | ext4 | No |
| ext4 | ext4 | Yes |
| tmpfs | any FS | No |
We may want to update AXL to not only whitelist BBAPI based on src/dest filesystem type, but also if BBAPI can actually do the transfer, and fallback to pthreads if necessary. So in the table above, we'd fallback to pthreads on /tmp <-> /p/gpfs transfers, and use BBAPI on the others.
If the user explicitly requests the BBAPI in their configuration and it doesn't work, I think it's fine to just error out. I think that would either be a configuration error on their part or a system software error, either of which they probably would like to know about so that it can get fixed.
Regarding the table in https://github.com/LLNL/scr/issues/163#issuecomment-611726077: /tmp on the system I tested on was actually EXT4, not tmpfs. I've since tested tmpfs (transfers don't work at all with BBAPI), and updated the table.
@adammoody I just opened https://github.com/ECP-VeloC/AXL/pull/64 which disables fallback to pthreads by default, but you can enable it again in cmake with -DENABLE_BBAPI_FALLBACK. It's useful for me for testing, so I'd at least like to make it configurable. I've also updated it to use a FS type whitelist instead of checking if the FS supports extents.
I've been testing using the following script on one of our BBAPI machines:
#!/bin/bash
# Bypass mode is default - disable it to use AXL
export SCR_CACHE_BYPASS=0
ssdmnt=$(mount | grep -Eo /mnt/bb_[a-z0-9]+)
rm -fr /tmp/ssd
mkdir -p /tmp/ssd
gpfsmnt=$(mount | awk '/gpfs/{print $3}')
mkdir -p $gpfsmnt/`whoami`/testing
gpfsmnt=$gpfsmnt/`whoami`/testing
cat << EOF > ~/myscr.conf
SCR_COPY_TYPE=FILE
SCR_CLUSTER_NAME=`hostname`
SCR_FLUSH=1
STORE=/tmp GROUP=NODE COUNT=1 TYPE=pthread
STORE=$ssdmnt GROUP=NODE COUNT=1 TYPE=bbapi
STORE=$gpfsmnt GROUP=WORLD COUNT=1 TYPE=bbapi
CKPT=0 INTERVAL=1 GROUP=NODE STORE=/tmp TYPE=XOR SET_SIZE=4
CKPT=1 INTERVAL=4 GROUP=NODE STORE=$ssdmnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CKPT=2 INTERVAL=8 GROUP=NODE STORE=$gpfsmnt TYPE=XOR SET_SIZE=2 OUTPUT=1
CNTLDIR=/tmp/hutter2 BYTES=1GB
CACHEDIR=/tmp BYTES=1GB
CACHEDIR=/tmp/ssd BYTES=50GB
SCR_CACHE_BASE=/dev/shm
SCR_DEBUG=10
EOF
SCR_CONF_FILE=~/myscr.conf ~/scr/build/examples/test_api
The output from the script shows SCR successfully using AXL in pthreads and BBAPI mode:
AXL 0.3.0: lassen788: Read and copied /mnt/bb_2c90f9ab469844e39364103cb0b0d928/hutter2/scr.defjobid/scr.dataset.44/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.44/rank_0.ckpt sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.45/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.45/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.46/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.46/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: axl_pthread_func: Read and copied /tmp/hutter2/scr.defjobid/scr.dataset.47/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.47/rank_0.ckpt, rc 0 @ axl_pthread_func /g/g0/hutter2/scr/deps/AXL/src/axl_pthread.c:209
AXL 0.3.0: lassen788: Read and copied /p/gpfs1/hutter2/testing/hutter2/scr.defjobid/scr.dataset.48/rank_0.ckpt to /p/gpfs1/hutter2/testing/ckpt.48/rank_0.ckpt sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
So we know BBAPI and pthreads are working in the basic case.
Great! So far, so good.
A tip for getting the BB path, an alternative is to use the $BBPATH variable, which will be defined in the environment of your job. Matching on /mnt/bb_ should work if all goes well. However, the BB software is still buggy, and it often leaves behind stray /mnt/bb_ directories. If you end up on a node where the BB failed to clean up, you'll see multiple /mnt/bb_ paths, only one of which is valid for your job.
While testing last week, I ran across a FILO bug and created a PR: https://github.com/ECP-VeloC/filo/pull/9
So this is a little bizarre:
I noticed in my tests that that SCR was reporting that it successfully transfered a 1MB file called "rank_0" from the SSD to GPFS using the BBAPI:
SCR v1.2.0: rank 0 on butte20: Initiating flush of dataset 22
AXL 0.3.0: butte20: Read and copied /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0 to /p/gpfs1/hutter2/rank_0 sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
SCR v1.2.0: rank 0 on butte20: scr_flush_sync: 0.285768 secs, 1.048576e+06 bytes, 3.499343 MB/s, 3.499343 MB/s per proc
SCR v1.2.0: rank 0 on butte20: scr_flush_sync: Flush of dataset 22 succeeded
But the resulting file was 0 bytes.
$ ls -l /p/gpfs1/hutter2/rank_0
-rw------- 1 hutter2 hutter2 0 May 20 09:55 /p/gpfs1/hutter2/rank_0
$ ls -l /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0
-rw------- 1 hutter2 hutter2 1048576 May 20 09:49 /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/tmp/hutter2/scr.defjobid/scr.dataset.22/rank_0
I tried the same test with axl_cp -X bbapi ..., and got the same 0 byte file. I then manually created a 1MB file on the SSD and transferred it using the BBAPI to GPFS, and it worked:
$ dd if=/dev/zero of=/mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb bs=1M count=1
$ ls -l /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb
-rw------- 1 hutter2 hutter2 1048576 May 20 10:03 /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb
$ axl_cp -X bbapi /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb /p/gpfs1/hutter2/zero_1mb
AXL 0.3.0: butte20: Read and copied /mnt/bb_ce4375e353b3b68966779cab5adbcb9c/zero_1mb to /p/gpfs1/hutter2/zero_1mb sucessfully @ axl_async_wait_bbapi /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:452
$ ls -l /p/gpfs1/hutter2/zero_1mb
-rw------- 1 hutter2 hutter2 1048576 May 20 10:04 /p/gpfs1/hutter2/zero_1mb
I see the same failure on two of our IBM systems. I'm still investigating..
Issue 2, when I login to a machine for the first time and run my test, I always hit this error:
SCR v1.2.0: rank 0: Initiating flush of dataset 1
AXL 0.3.0 ERROR: AXL Error with BBAPI rc: -1 @ bb_check /g/g0/hutter2/scr/deps/AXL/src/axl_async_bbapi.c:131
AXL 0.3.0 ERROR: AXL Error with BBAPI details:
"error": {
"func": "queueTagInfo",
"line": "2903",
"sourcefile": "/u/tgooding/cast_proto1.5/CAST/bb/src/xfer.cc",
"text": "Transfer definition for contribid 436 already exists for LVKey(bb.proxy.44 (192.168.129.182),bef4bbc8-2214-4077-8264-75d8f993c296), TagID((1093216,835125249),2875789842327411), handle 4294969993. Extents have already been enqueued for the transfer definition. Most likely, an incorrect contribid was specified or a different tag should be used for the transfer."
},
If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with axl_cp and see if the same thing happens.
If I re-run the test the problem always goes away. Looks like a stale transfer handle issue. I'll try it with axl_cp and see if the same thing happens.
I'm not seeing it with axl_cp. I used axl_cp -X bbapi to copy a new file from SSD->GPFS right after logging in, and didn't get the transfer definition error. It's possible SCR/filo is exercising AXL in a more advanced way that causes the transfer error.
Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS with axl_cp -X sync cp and axl_cp -X bbapi, and just to make sure there was no funny business, I also timed how long it took to md5sum the file afterwards:
| type | copy | md5 |
|---|---|---|
| axl_cp -X sync | 1.8s | 32s |
| cp | 1.8s | 32s |
| axl_cp -X bbapi | 16.5s | 32s |
BBAPI traffic is throttled in the network, so I’m not surprised it’s slower... but that is quite slow.
It's possible that the copy was faster because the whole file was in the page cache. That would mean the regular copies were just reading the data from memory rather than the SSD. I believe BBAPI just does a direct device-to-device SSD->GPFS transfer, which would bypass the page cache. Unfortunately, I was unable to test with zeroed caches (echo 3 > /proc/sys/vm/drop_caches) since I'm not root.
The 0B rank_0 file issue arises because it is a sparse file.
$ fiemap /mnt/bb_131fa614a608da727b038ed08e6eaad4/tmp/hutter2/scr.defjobid/scr.dataset.5/rank_0
ioctl success, extents = 0
Since BBAPI transfers extents, it would make sense that it would create a zero byte file for a sparse file. Proof:
# create one regular 1M file, and one sparse file
$ dd if=/dev/zero of=$BBPATH/file1 bs=1M count=1
$ truncate -s 1m $BBPATH/file2
# lookup extents
$ fiemap $BBPATH/file1
ioctl success, extents = 1
$ fiemap $BBPATH/file2
ioctl success, extents = 0
# both files appear to be 1MB to the source filesystem
$ ls -l $BBPATH/file1
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file1
$ ls -l $BBPATH/file2
-rw------- 1 hutter2 hutter2 1048576 May 20 14:00 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file2
# transfer them using BBAPI
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file1 /p/gpfs1/hutter2/
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file2 /p/gpfs1/hutter2/
# sizes after transfer
$ ls -l /p/gpfs1/hutter2/
total 1024
-rw------- 1 hutter2 hutter2 1048576 May 20 14:01 file1
-rw------- 1 hutter2 hutter2 0 May 20 14:02 file2
It gets really interesting when you create and transfer a partially sparse file:
$ echo "hello world" > /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ truncate -s 1M /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ ls -l /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
-rw------- 1 hutter2 hutter2 1048576 May 20 14:11 /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
$ fiemap /mnt/bb_410f5f473f8a95cc2ac227b2c7ee1742/file3
ioctl success, extents = 1
$ ~/scr/deps/AXL/build/test/axl_cp -X bbapi $BBPATH/file3 /p/gpfs1/hutter2/
$ ls -l /p/gpfs1/hutter2/file3
-rw------- 1 hutter2 hutter2 65536 May 20 14:12 file3
$ cat /p/gpfs1/hutter2/file3
hello world
I then did another test where I created sparse section between two extents.
echo "hello world" > file4
truncate -s 1M file4
echo "the end" >> file4
When I transferred that file using the BBAPI, it was the correct size and had the correct data.
TL;DR: don't end your file with sparse sections, or they're not going to get transferred correctly with BBAPI.
Nice find, @tonyhutter ! Yeah, we should file that as a bug with IBM. It should be keeping the correct file size, even if sparse.
Can you create an issue describing the problem here: https://github.com/ibm/cast/issues
Can you create an issue describing the problem here: https://github.com/ibm/cast/issues
Done: https://github.com/IBM/CAST/issues/918
Another BBAPI observation:
I used test_ckpt to create a 9GB checkpoint, and then killed it off when it was flushing the checkpoint from SSD to GPFS (using BBAPI). The process was killed, but I noticed the file kept transferring until it reached the size it should have been:
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 8053063680 May 20 17:15 rank_0
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 8321499136 May 20 17:15 rank_0
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 8589934592 May 20 17:15 rank_0
$ ls -l /p/gpfs1/hutter2/
-rw------- 1 hutter2 hutter2 9395240960 May 20 17:15 rank_0
This makes sense, as the BBAPI deamon is going to keep transferring the file independently of the calling process. I also tried killing the process with a SIGSEGV (segfault) to simulate a job crashing, and a SIGKILL and saw the same thing - the file kept transferring until completion.
We could add an on_exit() call to AXL to have it cancel all existing transfers if the process is killed. That would solve the issue for all non-SIGKILL terminations, which would be the common case.
I think that would be a good option. I can envision cases where AXL users would actually still want their transfer to complete even though their process has exited, so we'd probably want this to be configurable. Perhaps by individual transfer?
Someone has also asked whether AXL can delete their source files for them after the transfer. That might be another nice feature.
Fun fact: BBAPI appears to be way slower than a vanilla copy. I timed the amount of time it took to copy a 10GB, random-data, file from SSD to GPFS
That is expected. BB transfers are intentionally throttled via InfiniBand QoS at ~0.6GBps per node to minimize network impacts to MPI and demand I/O. So 10GB should take around 16-17 seconds when transferring in the SSD->GPFS direction. The transfers in the GPFS->SSD direction are not governed and you should see near native SSD write speeds.
@tgooding out of curiosity, can the throttling be adjusted or disabled?
"Adjusting" can be done. Its a cluster wide parameter and would require changing the InfiniBand switch settings - not something you'd want to do often.
"Disabling" is easier. I/O on port 4420 is throttled, so you just need to change to use the other defined NVMe over Fabrics port (4421).
I think you can edit the configuration file /etc/ibm/bb.cfg and change "readport" to 2. Then restart bbProxy.
Alternatively, you can pass in a custom config template via bbactivate --configtempl=<myaltconfig>. The original/default template is at /opt/ibm/bb/scripts/bb.cfg
@tgooding thank you! :+1:
@adammoody regarding "should we cancel/not-cancel existing transfers on AXL restart", I forgot about this thread from around a year ago:
https://github.com/ECP-VeloC/AXL/issues/57
I'll put my comments in there since it already has a lot of good discussion.
Regarding:
- Need to be able to cancel transfer of files on restart. Requires tracking set of transfer handles used (with redundancy to account for failures). Ensure everything was cancelled successfully in SCR_Init during restart. Ensure final transfers are cancelled or complete in the case of scavenge/post-run transfers.
See: https://github.com/ECP-VeloC/AXL/issues/57#issuecomment-634226157 .
TL;DR: sync/pthread transfers get killed with the job automatically, and I don't see how we could cancel BBAPI transfers.
@tonyhutter : if you haven't, look at using BB_GetTransferList() at SCR startup time for getting the active handles.
@tonyhutter , we also have a working test case in which a later job step cancels the transfer started by the previous job step. It's in the LLNL bitbucket, so I'll point you to that separately.