ior icon indicating copy to clipboard operation
ior copied to clipboard

Disable lseek in POSIX_Xfer?

Open BinDong314 opened this issue 3 years ago • 3 comments

Hi IOR, Thanks in advance and hopefully get some help to understand the lseek in POSIX_Xfer The lseek refers to the below line in the code: https://github.com/hpc/ior/blob/main/src/aiori-POSIX.c#L695

Specifically, I ran the below command. srun -n 64 ior -a POSIX -t 1k -b 128m -s 1 -r -v -k -R -w

First, I ran the the above command with original IOR code (it works fine). Second, I ran the same command by only commenting out the lseek64 in the IOR (L695 and L696). (commenting out here means get rid of these two lines and recompile the IOR)

Since the command asks for a check and it throws the warning and FAILED comparison (see below at the end).

Based on the doc of read, which says " On files that support seeking, the read operation commences at the file offset, and the file offset is incremented by the number of bytes read." The same thing applies to read.

So, in case, FAILED comparison should not happen. The major reason is that each MPI rank writes data sequentially (lseek seems to be not necessary for me) (Or, maybe lseek is needed only needed for the first read/write?)

https://man7.org/linux/man-pages/man2/read.2.html

Bests, Bin

ior-3.3.0 $ more ior-diff-osts-read_64601820.out /global/cscratch1/sd/dbin/IOR-test-1-1M-osts IOR-3.3.0: MPI Coordinated Test of Parallel I/O Began : Thu Dec 1 11:17:57 2022 Command line : /global/project/projectdirs/m2621/dbin/soft/ior-3.3.0/src/ior -a POSIX -t 1k -b 128m -s 1 -r -v -k -R -w Machine : Linux nid12885 Start time skew across all tasks: 0.00 sec TestID : 0 StartTime : Thu Dec 1 11:17:57 2022 Path : /global/cscratch1/sd/dbin/IOR-test-1-1M-osts FS : 27503.0 TiB Used FS: 76.0% Inodes: 5955.2 Mi Used Inodes: 26.0% Participating tasks: 64

Options: api : POSIX apiVersion : test filename : testFile access : single-shared-file type : independent segments : 1 ordering in a file : sequential ordering inter file : no tasks offsets nodes : 2 tasks : 64 clients per node : 32 repetitions : 1 xfersize : 1024 bytes blocksize : 128 MiB aggregate filesize : 8 GiB

Results:

access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter


Commencing write performance test: Thu Dec 1 11:17:57 2022 WARNING: Expected aggregate file size = 8589934592. WARNING: Stat() of aggregate file size = 134217728. WARNING: Using actual aggregate bytes moved = 8589934592. write 141.64 145040 0.000441 131072 1.00 0.001418 57.84 0.000287 57.84 0 [52] FAILED comparison of buffer containing 8-byte ints: [52] File name = testFile [52] In transfer 0, 64 errors between buffer indices 0 and 126. [52] File byte offset = 6979321856: [52] Expected: 0x000000346388fde5 0000000000000008 000000346388fde5 0000000000000018 [52] Actual: 0x000000006388fde5 0000000000000008 000000006388fde5 0000000000000018

BinDong314 avatar Dec 01 '22 19:12 BinDong314

The reason is that you use a shared file. If you use -F (individual files) then it works as you expect. There will be one initial seek when you run it with multiple processes for each process to find the block where it will write sequentially (otherwise all processes overwrite each other).

JulianKunkel avatar Dec 01 '22 20:12 JulianKunkel

Hi @JulianKunkel , Thanks for the response. It looks like that one can only allow lseek64 once at the beginning of the sequentially read.

So, I added a static variable (around line L695) to call lseek64 once.

https://github.com/hpc/ior/blob/main/src/aiori-POSIX.c#L695

    /* seek to offset */
    static int lseek64_called = 0;
    if (!lseek64_called){
        if (lseek64(fd, param->offset, SEEK_SET) == -1){
           ERRF("lseek64(%d, %lld, SEEK_SET) failed", fd, param->offset);
       }
       lseek64_called = 1;
    }

Then, I ran the below command which just reads the data. ior-3.3.0 $ srun -n 64 ior -a posix -t 1k -b 128m -s 1 -r -v -k

Does this sound to you ?

If it makes sense, we can find a sophisticated way to add this small change.

Bests, Bin

BinDong314 avatar Dec 03 '22 17:12 BinDong314

I don't think that any optimization of lseek() is necessary. In any benchmark I looked at, the overhead of the call is neglectable. I think I'd rather like to replace read()/write() with pread()/pwrite() to get rid of lseek().

JulianKunkel avatar Dec 04 '22 10:12 JulianKunkel

I will close this issue now, let me know if any argument remains.

JulianKunkel avatar Dec 19 '22 18:12 JulianKunkel